SlideShare a Scribd company logo
Tivoli Workload Scheduler LoadLeveler




Using and Administering
Version 3 Release 5




                                        SA22-7881-08
Ibm tivoli workload scheduler load leveler using and administering v3.5
Tivoli Workload Scheduler LoadLeveler




Using and Administering
Version 3 Release 5




                                        SA22-7881-08
Note
  Before using this information and the product it supports, read the information in “Notices” on page 745.




Ninth Edition (November 2008)
This edition applies to version 3, release 5, modification 0 of IBM Tivoli Workload Scheduler LoadLeveler (product
numbers 5765-E69 and 5724-I23) and to all subsequent releases and modifications until otherwise indicated in new
editions. This edition replaces SA22-7881-07. Significant changes or additions to the text and illustrations are
indicated by a vertical line (|) to the left of the change.
IBM welcomes your comments. A form for readers’ comments may be provided at the back of this publication, or
you can send your comments to the address:
   International Business Machines Corporation
   Department 58HA, Mail Station P181
   2455 South Road
   Poughkeepsie, NY 12601-5400
   United States of America

   FAX (United States & Canada): 1+845+432-9405
   FAX (Other Countries):
     Your International Access Code +1+845+432-9405

   IBMLink™ (United States customers only): IBMUSM10(MHVRCFS)
   Internet e-mail: mhvrcfs@us.ibm.com
If you want a reply, be sure to include your name, address, and telephone or FAX number.
Make sure to include the following in your comment or note:
v Title and order number of this publication
v Page number or topic related to your comment
When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any
way it believes appropriate without incurring any obligation to you.
©Copyright 1986, 1987, 1988, 1989, 1990, 1991 by the Condor Design Team.
©Copyright International Business Machines Corporation 1986, 2008. All rights reserved. US Government Users
Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Figures . . . . . . . . . . . . . . . ix                                 LoadLeveler for AIX and LoadLeveler for Linux
                                                                         compatibility . . . . . . . . . . . . . .                35
Tables . . . . . . . . . . . . . . . xi                                    Restrictions for LoadLeveler for Linux . . . .         36
                                                                           Features not supported in LoadLeveler for Linux        36
                                                                           Restrictions for LoadLeveler for AIX and
About this information . . . . . . . . xiii                                LoadLeveler for Linux mixed clusters . . . .           37
Who should use this information . .       . . . . . xiii
Conventions and terminology used in      this
information . . . . . . . . .             . . . . . xiii                 Part 2. Configuring and managing
Prerequisite and related information .    . . . . . xiv                  the TWS LoadLeveler environment . 39
How to send your comments . . .            . . . . . xv
                                                                         Chapter 4. Configuring the LoadLeveler
Summary of changes . . . . . . . . xvii                                  environment . . . . . . . . . . . . 41
                                                                         Modifying a configuration file . . . . . . . .           42
Part 1. Overview of TWS                                                  Defining LoadLeveler administrators . . . . . .          43
                                                                         Defining a LoadLeveler cluster . . . . . . . .           44
LoadLeveler concepts and operation 1
                                                                            Choosing a scheduler . . . . . . . . . .              44
                                                                            Setting negotiator characteristics and policies . .   45
Chapter 1. What is LoadLeveler? . . . . 3                                   Specifying alternate central managers . . . . .       46
LoadLeveler basics . . . . . . . . . .                 .       . 4          Defining network characteristics . . . . . .          47
LoadLeveler: A network job management and                                   Specifying file and directory locations . . . .       47
scheduling system . . . . . . . . . .                  .     . 4            Configuring recording activity and log files. . .     48
   Job definition . . . . . . . . . . .                .     . 5            Setting up file system monitoring . . . . . .         54
   Machine definition . . . . . . . . .                .     . 6         Defining LoadLeveler machine characteristics . . .       54
How LoadLeveler schedules jobs . . . . .               .     . 7            Defining job classes that a LoadLeveler machine
How LoadLeveler daemons process jobs . . .             .     . 8            will accept . . . . . . . . . . . . . .               55
   The master daemon . . . . . . . . .                 .     . 9            Specifying how many jobs a machine can run . .        55
   The Schedd daemon . . . . . . . .               .       . 10          Defining security mechanisms . . . . . . . .             56
   The startd daemon . . . . . . . . .             .       . 11             Configuring LoadLeveler to use cluster security
   The negotiator daemon . . . . . . .             .       . 13             services . . . . . . . . . . . . . . .                57
   The kbdd daemon . . . . . . . . .               .       . 14          Defining usage policies for consumable resources . .     60
   The gsmonitor daemon . . . . . . .              .       . 14          Enabling support for bulk data transfer and rCxt
The LoadLeveler job cycle . . . . . . .            .       . 16          blocks . . . . . . . . . . . . . . . .                   61
   LoadLeveler job states . . . . . . . .          .       . 19          Gathering job accounting data . . . . . . . .            61
Consumable resources . . . . . . . . .             .       . 22             Collecting job resource data on serial and parallel
   Consumable resources and AIX Workload                                    jobs . . . . . . . . . . . . . . . .                  62
   Manager . . . . . . . . . . . .                 .       . 24      |      Collecting accounting information for recurring
Overview of reservations . . . . . . . .           .       . 25      |      jobs . . . . . . . . . . . . . . . .                  63
Fair share scheduling overview . . . . . .         .       . 27             Collecting accounting data for reservations . . .     63
                                                                            Collecting job resource data based on machines        64
Chapter 2. Getting a quick start using                                      Collecting job resource data based on events . .      64
the default configuration . . . . . . . 29                                  Collecting job resource information based on user
What you need to know before you begin . .         .       .   29           accounts . . . . . . . . . . . . . .                  65
Using the default configuration files . . . .      .       .   29           Collecting the accounting information and
LoadLeveler for Linux quick start . . . . .        .       .   30           storing it into files . . . . . . . . . . .           66
  Quick installation . . . . . . . . .             .       .   30           Producing accounting reports . . . . . . .            66
  Quick configuration . . . . . . . .              .       .   30           Correlating AIX and LoadLeveler accounting
  Quick verification . . . . . . . . .             .       .   30           records . . . . . . . . . . . . . . .                 66
Post-installation considerations . . . . . .       .       .   31           64-bit support for accounting functions . . . .       67
  Starting LoadLeveler . . . . . . . .             .       .   31           Example: Setting up job accounting files . . . .      67
  Location of directories following installation   .       .   32        Managing job status through control expressions . .      68
                                                                            How control expressions affect jobs . . . . .         69
                                                                         Tracking job processes . . . . . . . . . . .             70
Chapter 3. What operating systems are
                                                                         Querying multiple LoadLeveler clusters . . . . .         71
supported by LoadLeveler? . . . . . . 35                                 Handling switch-table errors. . . . . . . . .            72


                                                                                                                                  iii
Providing additional job-processing controls through        |      Configuring LoadLeveler to support data
    installation exits . . . . . . . . . . . . .           72   |      staging . . . . . . . . . . . . .                   . 114
       Controlling the central manager scheduling cycle    73       Using an external scheduler . . . . . . .              . 115
       Handling DCE security credentials . . . . .         74          Replacing the default LoadLeveler scheduling
       Handling an AFS token . . . . . . . . .             75          algorithm with an external scheduler . . .          . 116
       Filtering a job script . . . . . . . . . .          76          Customizing the configuration file to define an
       Writing prolog and epilog programs . . . . .        77          external scheduler . . . . . . . . . .              . 118
       Using your own mail program . . . . . . .           81          Steps for getting information about the
                                                                       LoadLeveler cluster, its machines, and jobs .       .   118
    Chapter 5. Defining LoadLeveler                                    Assigning resources and dispatching jobs . .        .   122
    resources to administer . . . . . . . 83                        Example: Changing scheduler types . . . . .            .   126
                                                                    Preempting and resuming jobs . . . . . .               .   126
    Steps for modifying an administration file . . . . 83
                                                                       Overview of preemption . . . . . . .                .   127
    Defining machines . . . . . . . . . . . . 84
                                                                       Planning to preempt jobs . . . . . . .              .   128
       Planning considerations for defining machines . 85
                                                                       Steps for configuring a scheduler to preempt
       Machine stanza format and keyword summary       86
                                                                       jobs . . . . . . . . . . . . . .                    . 130
       Examples: Machine stanzas . . . . . . . . 86
                                                                    Configuring LoadLeveler to support reservations          131
    Defining adapters . . . . . . . . . . . . 86
                                                                       Steps for configuring reservations in a
       Configuring dynamic adapters . . . . . . . 87
                                                                       LoadLeveler cluster . . . . . . . . .               . 132
       Configuring InfiniBand adapters . . . . . . 87
                                                                    Steps for integrating LoadLeveler with the AIX
       Adapter stanza format and keyword summary       88
                                                                    Workload Manager . . . . . . . . . .                   . 137
       Examples: Adapter stanzas . . . . . . . . 89
                                                                    LoadLeveler support for checkpointing jobs . .         . 139
    Defining classes . . . . . . . . . . . . . 89
                                                                       Checkpoint keyword summary . . . . .                . 139
       Using limit keywords . . . . . . . . . . 89
                                                                       Planning considerations for checkpointing jobs        140
       Allowing users to use a class . . . . . . . 92
                                                                       AIX checkpoint and restart limitations . . .        . 141
       Class stanza format and keyword summary . . 92
                                                                       Naming checkpoint files and directories . .         . 145
       Examples: Class stanzas . . . . . . . . . 93
                                                                       Removing old checkpoint files . . . . . .           . 146
    Defining user substanzas in class stanzas . . . . 94
                                                                    LoadLeveler scheduling affinity support . . .          . 146
       Examples: Substanzas . . . . . . . . . . 95
                                                                       Configuring LoadLeveler to use scheduling
    Defining users . . . . . . . . . . . . . 97
                                                                       affinity . . . . . . . . . . . . .                  .   147
       User stanza format and keyword summary . . . 97
                                                                    LoadLeveler multicluster support. . . . . .            .   148
       Examples: User stanzas . . . . . . . . . 98
                                                                       Configuring a LoadLeveler multicluster . .          .   150
    Defining groups . . . . . . . . . . . . . 99
                                                                |      Scale-across scheduling with multiclusters . .      .   153
       Group stanza format and keyword summary . . 99
                                                                    LoadLeveler Blue Gene support . . . . . .              .   155
       Examples: Group stanzas . . . . . . . . . 99
                                                                       Configuring LoadLeveler Blue Gene support               157
    Defining clusters . . . . . . . . . . . . 100
                                                                       Blue Gene reservation support. . . . . .            .   159
       Cluster stanza format and keyword summary      100
                                                                       Blue Gene fair share scheduling support . .         .   159
       Examples: Cluster stanzas . . . . . . . . 100
                                                                       Blue Gene heterogeneous memory support .            .   160
                                                                       Blue Gene preemption support . . . . .              .   160
    Chapter 6. Performing additional                                   Blue Gene/L HTC partition support . . . .           .   160
    administrator tasks . . . . . . . . . 103                       Using fair share scheduling . . . . . . . .            .   160
    Setting up the environment for parallel jobs . . .   104           Fair share scheduling keywords . . . . .            .   161
       Scheduling considerations for parallel jobs . .   104           Reconfiguring fair share scheduling keywords            163
       Steps for reducing job launch overhead for                      Example: three groups share a LoadLeveler
       parallel jobs . . . . . . . . . . . . .           105           cluster . . . . . . . . . . . . . .                 . 164
       Steps for allowing users to submit interactive                  Example: two thousand students share a
       POE jobs . . . . . . . . . . . . . .              106           LoadLeveler cluster . . . . . . . . .               . 165
       Setting up a class for parallel jobs . . . . .    106           Querying information about fair share
|      Striping when some networks fail . . . . .        107           scheduling . . . . . . . . . . . .                  .   166
       Setting up a parallel master node . . . . . .     108           Resetting fair share scheduling . . . . .           .   166
       Configuring LoadLeveler to support MPICH                        Saving historic data . . . . . . . . .              .   166
       jobs . . . . . . . . . . . . . . .                108           Restoring saved historic data . . . . . .           .   167
       Configuring LoadLeveler to support MVAPICH                   Procedure for recovering a job spool. . . . .          .   167
       jobs . . . . . . . . . . . . . . .                108
       Configuring LoadLeveler to support                           Chapter 7. Using LoadLeveler’s GUI to
       MPICH-GM jobs . . . . . . . . . . .               109
                                                                    perform administrator tasks . . . . . 169
    Using the BACKFILL scheduler . . . . . . .           110
                                                                    Job-related administrative actions. . .    .   .   .   . 169
       Tips for using the BACKFILL scheduler . . .       112
                                                                    Machine-related administrative actions .   .   .   .   . 172
       Example: BACKFILL scheduling . . . . . .          113
|   Data staging . . . . . . . . . . . . . .             113



    iv   TWS LoadLeveler: Using and Administering
Part 3. Submitting and managing                                  Checkpointing a job .   .   .   .   .   .   .   .    .   .   . 232
    TWS LoadLeveler jobs . . . . . . 177
                                                                     Chapter 10. Example: Using
                                                                     commands to build, submit, and
    Chapter 8. Building and submitting
                                                                     manage jobs . . . . . . . . . . . . 235
    jobs . . . . . . . . . . . . . . . 179
    Building a job command file . . . . . . . .                179
       Using multiple steps in a job command file . .          180   Chapter 11. Using LoadLeveler’s GUI
       Examples: Job command files . . . . . . .               181   to build, submit, and manage jobs . . 237
    Editing job command files . . . . . . . . .                185   Building jobs . . . . . . . . . . . .                        .   237
    Defining resources for a job step . . . . . . .            185   Editing the job command file . . . . . . .                   .   249
|   Submitting jobs requesting data staging . . . .            186   Submitting a job command file . . . . . .                    .   250
    Working with coscheduled job steps . . . . . .             187   Displaying and refreshing job status . . . . .               .   251
       Submitting coscheduled job steps . . . . . .            187   Sorting the Jobs window . . . . . . . .                      .   252
       Determining priority for coscheduled job steps          187   Changing the priority of your jobs . . . . .                 .   253
       Supporting preemption of coscheduled job steps          187   Placing a job on hold . . . . . . . . . .                    .   253
       Coscheduled job steps and commands and APIs             188   Releasing the hold on a job . . . . . . . .                  .   253
       Termination of coscheduled steps . . . . . .            188   Canceling a job . . . . . . . . . . . .                      .   254
    Using bulk data transfer . . . . . . . . . .               188   Modifying consumable resources and other job
    Preparing a job for checkpoint/restart . . . . .           190   attributes . . . . . . . . . . . . . .                       .   254
    Preparing a job for preemption . . . . . . .               193   Taking a checkpoint . . . . . . . . . .                      .   254
    Submitting a job command file . . . . . . .                193   Adding a job to a reservation . . . . . . .                  .   255
       Submitting a job using a submit-only machine            194   Removing a job from a reservation . . . . .                  .   255
    Working with parallel jobs . . . . . . . . .               194   Displaying and refreshing machine status . . .               .   255
       Step for controlling whether LoadLeveler copies               Sorting the Machines window . . . . . . .                    .   257
       environment variables to all executing nodes . .        195   Finding the location of the central manager . .              .   257
       Ensuring that parallel jobs in a cluster run on               Finding the location of the public scheduling
       the correct levels of PE and LoadLeveler                      machines . . . . . . . . . . . . . .                         . 258
       software . . . . . . . . . . . . . .                    195   Finding the type of scheduler in use . . . . .               . 258
       Task-assignment considerations . . . . . .              196   Specifying which jobs appear in the Jobs window                258
       Submitting jobs that use striping . . . . . .           198   Specifying which machines appear in Machines
       Running interactive POE jobs . . . . . . .              203   window . . . . . . . . . . . . . .                           . 259
       Running MPICH, MVAPICH, and MPICH-GM                          Saving LoadLeveler messages in a file . . . .                . 259
       jobs . . . . . . . . . . . . . . .                      204
       Examples: Building parallel job command files           207   Part 4. TWS LoadLeveler
       Obtaining status of parallel jobs . . . . . .           212
       Obtaining allocated host names . . . . . .              212   interfaces reference . . . . . . . 261
    Working with reservations . . . . . . . . .                213
       Understanding the reservation life cycle . . .          214   Chapter 12. Configuration file
       Creating new reservations . . . . . . . .               216   reference . . . . . . . . . . . . . 263
       Submitting jobs to run under a reservation . .          218   Configuration file syntax . . . . . . . .                    . 263
       Removing bound jobs from the reservation . .            220     Numerical and alphabetical constants . . .                 . 264
       Querying existing reservations . . . . . .              221     Mathematical operators . . . . . . . .                     . 264
       Modifying existing reservations . . . . . .             221     64-bit support for configuration file keywords
       Canceling existing reservations . . . . . .             222     and expressions . . . . . . . . . .                        .   264
    Submitting jobs requesting scheduling affinity . .         222   Configuration file keyword descriptions . . .                .   265
    Submitting and monitoring jobs in a LoadLeveler                  User-defined keywords . . . . . . . . .                      .   313
    multicluster . . . . . . . . . . . . . .                   223   LoadLeveler variables . . . . . . . . .                      .   314
       Steps for submitting jobs in a LoadLeveler                      Variables to use for setting dates . . . . .               .   319
       multicluster environment . . . . . . . .                224     Variables to use for setting times . . . . .               .   320
    Submitting and monitoring Blue Gene jobs . . .             226
                                                                     Chapter 13. Administration file
    Chapter 9. Managing submitted jobs                     229       reference . . . . . . . . . . . . . 321
    Querying the status of a job . . . . .         .   .   .   229   Administration file structure and syntax . . .               . 321
    Working with machines . . . . . . .            .   .   .   230     Stanza characteristics . . . . . . . . .                   . 323
    Displaying currently available resources .     .   .   .   230     Syntax for limit keywords . . . . . . .                    . 324
    Setting and changing the priority of a job .   .   .   .   230     64-bit support for administration file keywords              325
       Example: How does a job’s priority affect                     Administration file keyword descriptions . . .               . 327
       dispatching order?. . . . . . . .           .   .   . 231
    Placing and releasing a hold on a job . .      .   .   . 232
    Canceling a job . . . . . . . . . .            .   .   . 232

                                                                                                                         Contents      v
Chapter 14. Job command file                                      llstatus - Query machine status    . . . . . .      . 512
reference . . . . . . . . . . . . . 357                           llsubmit - Submit a job . . .      . . . . . .      . 531
Job command file syntax . . . . . . . .             .   357       llsummary - Return job resource   information for
   Serial job command file . . . . . . . .          .   357       accounting . . . . . . .           . . . . . .      . 535
   Parallel job command file . . . . . . .          .   358
   Syntax for limit keywords . . . . . . .          .   358       Chapter 17. Application programming
   64-bit support for job command file keywords         358       interfaces (APIs) . . . . . . . . . . 541
Job command file keyword descriptions . . .         .   359       64-bit support for the LoadLeveler APIs . . .       .   543
   Job command file variables . . . . . . .         .   399          LoadLeveler for AIX APIs . . . . . . .           .   543
   Run-time environment variables . . . . .         .   400          LoadLeveler for Linux APIs . . . . . .           .   544
   Job command file examples . . . . . .            .   401       Accounting API . . . . . . . . . . .                .   544
                                                                     GetHistory subroutine . . . . . . . .            .   545
Chapter 15. Graphical user interface                                 llacctval user exit . . . . . . . . . .          .   547
(GUI) reference . . . . . . . . . . . 403                         Checkpointing API . . . . . . . . . .               .   548
Starting the GUI . . . . . . . . . .            .   .   403          ckpt subroutine . . . . . . . . . . .            .   549
   Specifying GUI options . . . . . . .         .   .   404          ll_ckpt subroutine . . . . . . . . . .           .   550
   The LoadLeveler main window . . . .          .   .   404          ll_init_ckpt subroutine . . . . . . . .          .   553
   Getting help using the GUI . . . . . .       .   .   405          ll_set_ckpt_callbacks subroutine . . . . .       .   555
   Differences between LoadLeveler’s GUI and                         ll_unset_ckpt_callbacks subroutine . . . .       .   556
   other graphical user interfaces . . . . .     . .    406       Configuration API . . . . . . . . . . .             .   557
   GUI typographic conventions . . . . .         . .    406          ll_config_changed subroutine . . . . . .         .   558
   64-bit support for the GUI . . . . . .        . .    407          ll_read_config subroutine . . . . . . .          .   559
Customizing the GUI . . . . . . . . .            . .    407       Data access API . . . . . . . . . . .               .   560
   Syntax of an Xloadl file . . . . . . .        . .    407          Using the data access API . . . . . . .          .   560
   Modifying windows and buttons . . . .         . .    408          Understanding the LoadLeveler data access
   Creating your own pull-down menus . .         . .    409          object model. . . . . . . . . . . .              .   561
   Customizing fields on the Jobs window and    the                  Understanding the Blue Gene object model .       .   562
   Machines window . . . . . . . . .             . .    409          Understanding the Class object model . . .       .   562
   Modifying help panels . . . . . . .           . .    410          Understanding the Cluster object model . .       .   563
                                                                     Understanding the Fairshare object model . .     .   563
                                                                     Understanding the Job object model . . . .       .   564
Chapter 16. Commands . . . . . . . 411                               Understanding the Machine object model . .       .   565
llacctmrg - Collect machine history files . . . .       413          Understanding the MCluster object model . .      .   566
llbind - Bind job steps to a reservation . . . . .      415          Understanding the Reservations object model          566
llcancel - Cancel a submitted job . . . . . . .         421          Understanding the Wlmstat object model . .       .   567
llchres - Change attributes of a reservation . . .      424          ll_deallocate subroutine . . . . . . . .         .   568
llckpt - Checkpoint a running job step . . . . .        430          ll_free_objs subroutine . . . . . . . .          .   569
llclass - Query class information . . . . . . .         433          ll_get_data subroutine . . . . . . . .           .   570
llclusterauth - Generates public and private keys       438          ll_get_objs subroutine . . . . . . . .           .   624
llctl - Control LoadLeveler daemons . . . . . .         439          ll_next_obj subroutine . . . . . . . .           .   627
llextRPD - Extract data from an RSCT peer domain        443          ll_query subroutine . . . . . . . . .            .   628
llfavorjob - Reorder system queue by job . . . .        447          ll_reset_request subroutine . . . . . . .        .   629
llfavoruser - Reorder system queue by user . . .        449          ll_set_request subroutine . . . . . . .          .   630
llfs - Fair share scheduling queries and operations     450          Examples of using the data access API . . .      .   633
llhold - Hold or release a submitted job . . . .        454       Error handling API . . . . . . . . . .              .   639
llinit - Initialize machines in the LoadLeveler                      ll_error subroutine. . . . . . . . . .           .   640
cluster . . . . . . . . . . . . . . . .                 457       Fair share scheduling API . . . . . . . .           .   641
llmkres - Make a reservation . . . . . . . .            459          ll_fair_share subroutine . . . . . . . .         .   642
llmodify - Change attributes of a submitted job                   Reservation API . . . . . . . . . . .               .   643
step . . . . . . . . . . . . . . . .                    464          ll_bind subroutine . . . . . . . . . .           .   645
llmovejob - Move a single idle job from the local                    ll_change_reservation subroutine . . . . .       .   648
cluster to another cluster . . . . . . . . .            470          ll_init_reservation_param subroutine . . .       .   652
llmovespool - Move job records . . . . . . .            472          ll_make_reservation subroutine . . . . .         .   653
llpreempt - Preempt a submitted job step . . . .        474          ll_remove_reservation subroutine . . . . .       .   658
llprio - Change the user priority of submitted job            |      ll_remove_reservation_xtnd subroutine . . .      .   660
steps . . . . . . . . . . . . . . . .                   477       Submit API . . . . . . . . . . . . .                .   663
llq - Query job status . . . . . . . . . . .            479          llfree_job_info subroutine . . . . . . .         .   664
llqres - Query a reservation . . . . . . . . .          500          llsubmit subroutine . . . . . . . . .            .   665
llrmres - Cancel a reservation . . . . . . . .          508          monitor_program user exit . . . . . . .          .   667
llrunscheduler - Run the central manager’s                        Workload management API . . . . . . .               .   668
scheduling algorithm . . . . . . . . . . .              511          ll_cluster subroutine . . . . . . . . .          .   669

vi   TWS LoadLeveler: Using and Administering
ll_cluster_auth subroutine .    .   .   .   .   .   .   .   671          How do I find my remote job? . . . . . .                      716
      ll_control subroutine . . .     .   .   .   .   .   .   .   673          Why won’t my remote job run? . . . . . .                      717
      ll_modify subroutine . . .      .   .   .   .   .   .   .   677          Why does llq -X all show no jobs running when
      ll_move_job subroutine . .      .   .   .   .   .   .   .   681          there are jobs running? . . . . . . . . .                     717
      ll_move_spool subroutine .      .   .   .   .   .   .   .   683       Troubleshooting in a Blue Gene environment . . .                 717
      ll_preempt subroutine . .       .   .   .   .   .   .   .   686          Why do all of my Blue Gene jobs fail even
      ll_preempt_jobs subroutine .    .   .   .   .   .   .   .   688          though llstatus shows that Blue Gene is present?              718
      ll_run_scheduler subroutine     .   .   .   .   .   .   .   691          Why does llstatus show that Blue Gene is
      ll_start_job_ext subroutine .   .   .   .   .   .   .   .   692          absent? . . . . . . . . . . . . . .                           718
      ll_terminate_job subroutine .   .   .   .   .   .   .   .   696          Why did my Blue Gene job fail when the job
                                                                               was submitted to a remote cluster? . . . . .                  718
    Appendix A. Troubleshooting                                         |      Why does llmkres or llchres return ″Insufficient
    LoadLeveler . . . . . . . . . . . . 699                             |      resources to meet the request″ for a Blue Gene
                                                                        |      reservation when resources appear to be
    Frequently asked questions . . . . . . . . .                  699
                                                                        |      available?. . . . . . . . . . . . . .                         719
       Why won’t LoadLeveler start? . . . . . . .                 700
                                                                            Helpful hints . . . . . . . . . . . . .                          719
       Why won’t my job run? . . . . . . . . .                    700
                                                                               Scaling considerations . . . . . . . . .                      719
       Why won’t my parallel job run? . . . . . .                 703
                                                                               Hints for running jobs . . . . . . . . .                      720
       Why won’t my checkpointed job restart? . . .               704
                                                                               Hints for using machines . . . . . . . .                      723
       Why won’t my submit-only job run? . . . .                  705
                                                                               History files and Schedd . . . . . . . .                      724
       Why won’t my job run on a cluster with both
                                                                            Getting help from IBM . . . . . . . . . .                        724
       AIX and Linux machines? . . . . . . . .                    705
|      Why won’t my job run when scheduling affinity
|      is enabled on x86 and x86_64 systems? . . . .              705       Appendix B. Sample command output 725
       Why does a job stay in the Pending (or Starting)                     llclass -l command output listing . . . . . . .                  725
       state? . . . . . . . . . . . . . . .                       706       llq -l command output listing . . . . . . . .                    727
       What happens to running jobs when a machine                          llq -l command output listing for a Blue Gene
       goes down? . . . . . . . . . . . . .                       706       enabled system . . . . . . . . . . . . .                         729
       Why won’t my jobs run that were directed to an                       llq -l -x command output listing . . . . . . .                   730
       idle pool? . . . . . . . . . . . . .                       708       llstatus -l command output listing . . . . . .                   733
       What happens if the central manager isn’t                            llstatus -l -b command output listing . . . . .                  733
       operating? . . . . . . . . . . . . .                       708       llstatus -B command output listing . . . . . .                   735
       How do I recover resources allocated by a                            llstatus -P command output listing . . . . . .                   736
       Schedd machine? . . . . . . . . . . .                      710       llsummary -l -x command output listing . . . .                   736
       Why can’t I find a core file on Linux? . . . .             710       llsummary -l -x command output listing for a Blue
       Why am I seeing inconsistencies in my llfs                           Gene-enabled system . . . . . . . . . . .                        738
       output? . . . . . . . . . . . . . .                        711
       Why don’t I see my job when I issue the llq                          Appendix C. LoadLeveler port usage                              741
       command? . . . . . . . . . . . . .                         711
       What happens if errors are found in my                               Accessibility features for TWS
       configuration or administration file? . . . . .            711
                                                                            LoadLeveler . . . . . . . . . . . . 743
       Other questions . . . . . . . . . . .                      712
                                                                            Accessibility features .    .   .   .   .   .   .   .   .   .   . 743
    Troubleshooting in a multicluster environment . .             714
                                                                            Keyboard navigation .       .   .   .   .   .   .   .   .   .   . 743
       How do I determine if I am in a multicluster
                                                                            IBM and accessibility .     .   .   .   .   .   .   .   .   .   . 743
       environment? . . . . . . . . . . . .                       714
       How do I determine how my multicluster
       environment is defined and what are the                              Notices . . . . . . . . . . . . . . 745
       inbound and outbound hosts defined for each                          Trademarks .    .   .   .   .   .   .   .   .   .   .   .   .   . 746
       cluster? . . . . . . . . . . . . . .                       714
       Why is my multicluster environment not                               Glossary . . . . . . . . . . . . . 749
       enabled? . . . . . . . . . . . . . .                       714
       How do I find log messages from my                                   Index . . . . . . . . . . . . . . . 753
       multicluster-defined installation exits? . . . .           715
       Why won’t my remote job be submitted or
       moved? . . . . . . . . . . . . . .                         715
       Why did the CLUSTER_REMOTE_JOB_FILTER
       not update the job with all of the statements I
       defined? . . . . . . . . . . . . . .                       716




                                                                                                                                Contents     vii
viii   TWS LoadLeveler: Using and Administering
Figures
 1.   Example of a LoadLeveler cluster . . . . . 3     28.   MPICH job command file - sample 1              208
 2.   LoadLeveler job steps . . . . . . . . . 5        29.   MPICH job command file - sample 2              209
 3.   Multiple roles of machines . . . . . . . . 7     30.   MPICH-GM job command file - sample 1           210
 4.   High-level job flow . . . . . . . . . . 16       31.   MPICH-GM job command file - sample 2           210
 5.   Job is submitted to LoadLeveler . . . . . . 17   32.   MVAPICH job command file - sample 1            211
 6.   LoadLeveler authorizes the job . . . . . . 17    33.   MVAPICH job command file - sample 2            212
 7.   LoadLeveler prepares to run the job . . . . 18   34.   Using LOADL_PROCESSOR_LIST in a shell
 8.   LoadLeveler starts the job . . . . . . . . 18          script . . . . . . . . . . . . .              . 213
 9.   LoadLeveler completes the job . . . . . . 19     35.   Building a job command file . . . . .         . 235
10.   How control expressions affect jobs . . . . 70   36.   LoadLeveler build a job window . . . .        . 238
11.   Format of a machine stanza . . . . . . . 86      37.   Format of administration file stanzas           322
12.   Format of an adapter stanza . . . . . . . 88     38.   Format of administration file substanzas        322
13.   Format of a class stanza . . . . . . . . 93      39.   Sample administration file stanzas . . .      . 322
14.   Format of a user substanza . . . . . . . 95      40.   Sample administration file stanza with user
15.   Format of a user stanza . . . . . . . . 98             substanzas . . . . . . . . . . .              . 323
16.   Format of a group stanza . . . . . . . . 99      41.   Serial job command file . . . . . . .         . 358
17.   Format of a cluster stanza . . . . . . . 100     42.   Main window of the LoadLeveler GUI              405
18.   Multicluster Example . . . . . . . . . 101       43.   Creating a new pull-down menu . . . .         . 409
19.   Job command file with multiple steps       181   44.   TWS LoadLeveler Blue Gene object model          562
20.   Job command file with multiple steps and         45.   TWS LoadLeveler Class object model              563
      one executable . . . . . . . . . . . 181         46.   TWS LoadLeveler Cluster object model            563
21.   Job command file with varying input              47.   TWS LoadLeveler Fairshare object model          563
      statements . . . . . . . . . . . . 182           48.   TWS LoadLeveler Job object model . . .        . 565
22.   Using LoadLeveler variables in a job             49.   TWS LoadLeveler Machine object model            566
      command file . . . . . . . . . . . 183           50.   TWS LoadLeveler MCluster object model           566
23.   Job command file used as the executable    185   51.   TWS LoadLeveler Reservations object model       566
24.   Striping over multiple networks . . . . . 200    52.   TWS LoadLeveler Wlmstat object model            567
25.   Striping over a single network . . . . . . 202   53.   When the primary central manager is
26.   POE job command file – multiple tasks per              unavailable . . . . . . . . . . .             . 709
      node . . . . . . . . . . . . . . 207             54.   Multiple central managers . . . . . .         . 709
27.   POE sample job command file – invoking
      POE twice . . . . . . . . . . . . 208




                                                                                                             ix
x   TWS LoadLeveler: Using and Administering
Tables
 1. Summary of typographic conventions              xiv   |   35. Keywords for configuring scale-across
 2. Major topics in TWS LoadLeveler: Using and            |       scheduling . . . . . . . . . . . .               154
    Administering . . . . . . . . . . . . 1                   36. IBM System Blue Gene Solution
 3. Topics in the TWS LoadLeveler overview            3           documentation . . . . . . . . . . .              156
 4. LoadLeveler daemons . . . . . . . . . 8                   37. Blue Gene subtasks and associated
 5. startd determines whether its own state                       instructions . . . . . . . . . . . .             157
    permits a new job to run . . . . . . . . 12               38. Blue Gene related topics and associated
 6. Job state descriptions and abbreviations         20           information . . . . . . . . . . . .              157
 7. Location and description of product directories           39. Blue Gene configuring subtasks and
    following installation . . . . . . . . . 33                   associated instructions . . . . . . . .          157
 8. Location and description of directories for               40. Learning about building and submitting jobs      179
    submit-only LoadLeveler . . . . . . . . 33                41. Roadmap of user tasks for building and
 9. Roadmap of tasks for TWS LoadLeveler                          submitting jobs . . . . . . . . . . .            179
    administrators . . . . . . . . . . . 41                   42. Standard files for the five job steps . . . .    182
10. Roadmap of administrator tasks related to                 43. Checkpoint configurations . . . . . . .          191
    using or modifying the LoadLeveler                    |   44. Valid combinations of task assignment
    configuration file . . . . . . . . . . . 42           |       keywords are listed in each column . . . .       196
11. Roadmap for defining LoadLeveler cluster                  45. node and total_tasks . . . . . . . . .           196
    characteristics . . . . . . . . . . . . 44                46. Blocking . . . . . . . . . . . . .               197
12. Default locations for all of the files and                47. Unlimited blocking . . . . . . . . .             198
    directories . . . . . . . . . . . . . 47                  48. Roadmap of tasks for reservation owners and
13. Log control statements . . . . . . . . . 49                   users . . . . . . . . . . . . . .                213
14. Roadmap of configuration tasks for securing               49. Reservation states, abbreviations, and usage
    LoadLeveler operations . . . . . . . . 57                     notes . . . . . . . . . . . . . .                214
15. Roadmap of tasks for gathering job accounting             50. Instructions for submitting a job to run under
    data . . . . . . . . . . . . . . . 62                         a reservation . . . . . . . . . . . .            219
16. Collecting account data - modifying the                   51. Submitting and monitoring jobs in a
    configuration file . . . . . . . . . . . 67                   LoadLeveler multicluster . . . . . . . .         224
17. Roadmap of administrator tasks accomplished               52. Roadmap of user tasks for managing
    through installation exits . . . . . . . . 72                 submitted jobs . . . . . . . . . . .             229
18. Roadmap of tasks for modifying the                        53. How LoadLeveler handles job priorities           231
    LoadLeveler administration file . . . . . . 83            54. User tasks available through the GUI             237
19. Types of limit keywords . . . . . . . . 90                55. GUI fields and input . . . . . . . . .           239
20. Enforcing job step limits . . . . . . . . 91              56. Nodes dialog box . . . . . . . . . .             243
21. Setting limits . . . . . . . . . . . . 92                 57. Network dialog box fields . . . . . . .          244
22. Roadmap of additional administrator tasks       103       58. Build a job dialog box fields . . . . . .        245
23. Roadmap of BACKFILL scheduler tasks             111       59. Limits dialog box fields . . . . . . . .         247
24. Roadmap of tasks for using an external                    60. Checkpointing dialog box fieldsF . . . . .       248
    scheduler . . . . . . . . . . . . . 116                   61. Blue Gene job fields . . . . . . . . .           248
25. Effect of LoadLeveler keywords under an                   62. Modifying the job command file with the Edit
    external scheduler . . . . . . . . . . 116                    pull-down menu . . . . . . . . . .               249
26. Roadmap of tasks for using preemption           127       63. Modifying the job command file with the
27. Preemption methods for which LoadLeveler                      Tools pull-down menu . . . . . . . .             250
    automatically resumes preempted jobs . . . 129            64. Saving and submitting information . . . .        250
28. Preemption methods for which administrator                65. Sorting the jobs window . . . . . . . .          252
    or user intervention is required . . . . . 130            66. Sorting the machines window . . . . . .          257
29. Roadmap of reservation tasks for                          67. Specifying which jobs appear in the Jobs
    administrators . . . . . . . . . . . 132                      window . . . . . . . . . . . . .                 258
30. Roadmap of tasks for checkpointing jobs         139       68. Specifying which machines appear in
31. Deciding where to define the directory for                    Machines window . . . . . . . . . .              259
    staging executables . . . . . . . . . 141                 69. Configuration subtasks . . . . . . . .           263
32. Multicluster support subtasks and associated              70. BG_MIN_PARTITION_SIZE values . . . .             268
    instructions . . . . . . . . . . . . 149                  71. Administration file subtasks . . . . . .         321
33. Multicluster support related topics . . . . 149           72. Notes on 64-bit support for administration
34. Subtasks for configuring a LoadLeveler                        file keywords . . . . . . . . . . .              325
    multicluster . . . . . . . . . . . . 150

                                                                                                                   xi
73. Summary of possible values set for the                 90. FAIRSHARE specifications for ll_get_data
     env_copy keyword in the administration file .   335        subroutine . . . . . . . . . . . .               582
 74. Sample user and group settings for the                 91. JOBS specifications for ll_get_data subroutine   583
     max_reservations keyword . . . . . . .          345    92. MACHINES specifications for ll_get_data
 75. Job command file subtasks . . . . . . .         357        subroutine . . . . . . . . . . . .               614
 76. Notes on 64-bit support for job command file           93. MCLUSTERS specifications for ll_get_data
     keywords . . . . . . . . . . . . .              358        subroutine . . . . . . . . . . . .               619
 77. mcm_affinity_options default values . . . .     381    94. RESERVATIONS specifications for ll_get_data
 78. Example of a selection table . . . . . . .      406        subroutine . . . . . . . . . . . .               620
 79. Decision table . . . . . . . . . . .            407    95. WLMSTAT specifications for ll_get_data
 80. Decision table actions . . . . . . . . .        407        subroutine . . . . . . . . . . . .               622
 81. Window identifiers in the Xloadl file           408    96. query_daemon summary . . . . . . . .             624
 82. Resource variables for all the windows and             97. query_flags summary . . . . . . . . .            630
     the buttons . . . . . . . . . . . .             408    98. object_filter value related to the query flags
 83. Modifying help panels . . . . . . . .           410        value . . . . . . . . . . . . . .                631
 84. LoadLeveler command summary . . . . .           411    99. enum LL_reservation_data type . . . . .          649
 85. llmodify options and keywords . . . . .         468   100. How nodes should be arranged in the node
 86. LoadLeveler API summary . . . . . . .           541        list . . . . . . . . . . . . . . .               694
 87. BLUE_GENE specifications for ll_get_data              101. Why your job might not be running . . . .        700
     subroutine . . . . . . . . . . . .              571   102. Why your job might not be running . . . .        703
 88. CLASSES specifications for ll_get_data                103. Troubleshooting running jobs when a
     subroutine . . . . . . . . . . . .              576        machine goes down . . . . . . . . .              706
 89. CLUSTERS specifications for ll_get_data               104. LoadLeveler default port usage . . . . .         741
     subroutine . . . . . . . . . . . .              580




xii   TWS LoadLeveler: Using and Administering
About this information
                  IBM® Tivoli® Workload Scheduler (TWS) LoadLeveler® provides various ways of
                  scheduling and managing applications for best performance and most efficient use
                  of resources. LoadLeveler manages both serial and parallel jobs over a cluster of
                  machines or servers, which may be desktop workstations, dedicated servers, or
                  parallel machines. This information describes how to configure and administer this
                  LoadLeveler cluster environment, and to submit and manage jobs that run on
                  machines in the cluster.

    Who should use this information
                  This information is intended for two separate audiences:
                  v Personnel who are responsible for installing, configuring and managing the
                    LoadLeveler cluster environment. These people are called LoadLeveler
                    administrators. LoadLeveler administrative tasks include:
                    – Setting up configuration and administration files
                    – Maintaining the LoadLeveler product
                    – Setting up the distributed environment for allocating batch jobs
                  v Users who submit and manage serial and parallel jobs to run in the LoadLeveler
                    cluster.

                  Both LoadLeveler administrators and general users should be experienced with the
                  UNIX® commands. Administrators also should be familiar with:
                  v Cluster system management techniques such as SMIT, as it is used in the AIX®
                    environment
                  v Networking and NFS or AFS® protocols

    Conventions and terminology used in this information
                  Throughout the TWS LoadLeveler product information:
                  v TWS LoadLeveler for Linux® Multiplatform includes:
|                   – IBM System servers with Advanced Micro Devices (AMD) Opteron or Intel®
|                     Extended Memory 64 Technology (EM64T) processors
                    – IBM System x™ servers
                    – IBM BladeCenter® Intel processor-based servers
                    – IBM Cluster 1350™

                    Note: IBM Tivoli Workload Scheduler LoadLeveler is supported when running
                          Linux on non-IBM Intel-based and AMD hardware servers.

                          Supported hardware includes:
|                         – Servers with Intel 32-bit and Intel EM64T
|                         – Servers with AMD 64-bit technology
                  v Note that in this information:
                    – LoadLeveler is also referred to as Tivoli Workload Scheduler LoadLeveler and
                      TWS LoadLeveler.
                    – Switch_Network_Interface_For_HPS is also referred to as HPS or High
                      Performance Switch.



                                                                                                xiii
Table 1 describes the typographic conventions used in this information.
                        Table 1. Summary of typographic conventions
                        Typographic      Usage
                        Bold             v Bold words or characters represent system elements that you must use
                                           literally, such as commands, flags, and path names.
                                         v Bold words also indicate the first use of a term included in the glossary.
                        Italic           v Italic words or characters represent variable values that you must supply.
                                         v Italics are also used for book titles and for general emphasis in text.
                        Constant         Examples and information that the system displays appear in constant
                        width            width typeface.
                        []               Brackets enclose optional items in format and syntax descriptions.
                        {}               Braces enclose a list from which you must choose an item in format and
                                         syntax descriptions.
                        |                A vertical bar separates items in a list of choices. (In other words, it means
                                         “or.”)
                        <>               Angle brackets (less-than and greater-than) enclose the name of a key on
                                         the keyboard. For example, <Enter> refers to the key on your terminal or
                                         workstation that is labeled with the word Enter.
                        ...              An ellipsis indicates that you can repeat the preceding item one or more
                                         times.
                        <Ctrl-x>         The notation <Ctrl-x> indicates a control character sequence. For example,
                                         <Ctrl-c> means that you hold down the control key while pressing <c>.
                                        The continuation character is used in coding examples in this information
                                         for formatting purposes.



Prerequisite and related information
                        The Tivoli Workload Scheduler LoadLeveler publications are:
                        v Installation Guide, GI10-0763
                        v Using and Administering, SA22-7881
                        v Diagnosis and Messages Guide, GA22-7882

                        To access all TWS LoadLeveler documentation, refer to the IBM Cluster
                        Information Center, which contains the most recent TWS LoadLeveler
                        documentation in PDF and HTML formats. This Web site is located at:
                        http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp

                        A TWS LoadLeveler Documentation Updates file also is maintained on this Web
                        site. The TWS LoadLeveler Documentation Updates file contains updates to the
                        TWS LoadLeveler documentation. These updates include documentation
                        corrections and clarifications that were discovered after the TWS LoadLeveler
                        books were published.

                        Both the current TWS LoadLeveler books and earlier versions of the library are
                        also available in PDF format from the IBM Publications Center Web site located at:
                        http://www.elink.ibmlink.ibm.com/publications/servlet/pbi.wss

                        To easily locate a book in the IBM Publications Center, supply the book’s
                        publication number. The publication number for each of the TWS LoadLeveler
                        books is listed after the book title in the preceding list.
xiv   TWS LoadLeveler: Using and Administering
How to send your comments
             Your feedback is important in helping us to produce accurate, high-quality
             information. If you have any comments about this book or any other TWS
             LoadLeveler documentation:
             v Send your comments by e-mail to: mhvrcfs@us.ibm.com
                Include the book title and order number, and, if applicable, the specific location
                of the information you have comments on (for example, a page number or a
                table number).
             v Fill out one of the forms at the back of this book and return it by mail, by fax, or
                by giving it to an IBM representative.

             To contact the IBM cluster development organization, send your comments by
             e-mail to: cluster@us.ibm.com




                                                                          About this information   xv
xvi   TWS LoadLeveler: Using and Administering
Summary of changes
           The following sections summarize changes to the IBM Tivoli Workload Scheduler
           (TWS) LoadLeveler product and TWS LoadLeveler library for each new release or
           major service update for a given product version. Within each information unit in
           the library, a vertical line to the left of text and illustrations indicates technical
           changes or additions made to the previous edition of the information.

           Changes to TWS LoadLeveler for this release or update include:
           v New information:
             – Recurring reservation support:
               - The TWS LoadLeveler commands and APIs have been enhanced to support
                 recurring reservation.
               - Accounting records have been enhanced to have recurring reservation
                 entries.
               - The new recurring job command file keyword will allow a user to specify
                 that the job can run in every occurrence of the recurring reservation to
                 which it is bound.
             – Data staging support:
               - Jobs can request data files from a remote storage location before the job
                 executes and back to remote storage after it finishes execution.
               - Schedules data staging at submit time or just in time for the application
                 execution.
             – Multicluster scale-across scheduling support:
               - Allows a large job to span resources across more than one cluster
                  v Scale-across scheduling is a way to schedule jobs in the multicluster
                    environment to span resources across more than one cluster. This feature
                    allows large jobs that request more resources than any single cluster can
                    provide to combine the resources from more than one cluster and run
                    large jobs on the combined resources, effectively spanning resources
                    across more than one cluster.
                  v Allows utilization of fragmented resources from more than one cluster
                    – Fragmented resources occur when the resources available on a single
                       cluster cannot satisfy any single job on that cluster. This feature allows
                       any size job to take advantage of these resources by combining them
                       from multiple clusters.
             – Enhanced WLM support:
               - Integrates TWS LoadLeveler with AIX Workload Manager (WLM) virtual
                 memory and the large page resource limit support.
               - Enforces virtual memory and the large page limit usage of a job.
               - Reports statistics for virtual memory and the large page limit usage.
               - Dynamically changes virtual memory and the large page limit usage of a
                 job.
             – Enhanced adapter striping (sn_all) support:
               - Submits jobs to nodes that have one or more networks in the failed
                 (NOTREADY) state provided that all of the nodes assigned to the job have
                 more than half of the networks in the READY state.


                                                                                              xvii
- A new striping_with_minimum_networks configuration keyword has been
                               added to the class stanza to support striping with failed networks.
                          – Enhanced affinity support:
                            - Task affinity support has been enhanced on nodes that are booted in single
                               threaded (ST) mode and on nodes that do not support simultaneous
                               multithreading (SMT).
                          – NetworkID64 for Mellanox adapters on Linux systems with InfiniBand
                            support:
                            - Generates unique NetworkID64 IDs for adapter ports that are connected to
                               the same switch and have the same IP subnet address. This ensures that
                               ports that are connected to the same switch, but are configured with
                               different IP subnet addresses, will get different NeworkID64 values.
                        v Changed information:
                          – This is the last release that will provide the following functions:
                            - The Motif-based graphical user interface xloadl. The function available in
                               xloadl has been frozen since TWS LoadLeveler 3.3.2 and there are no plans
                               to update this GUI with any new function added to TWS LoadLeveler after
                               that level.
                            - The IBM BladeCenter JS21 with a BladeCenter H chassis interconnected
                               with the InfiniBand Host Channel Adapters connected to a Cisco
                               InfiniBand SDR switch.
                            - The IBM Power System 575 (Model 9118-575) and IBM Power System 550
                               (Model 9133-55A) interconnected with the InfiniBand Host Channel
                               Adapter and Cisco switch.
                            - The High Performance Switch.
                          – If you have a mixed TWS LoadLeveler cluster and need to run your job on a
                            specific operating system or architecture, you must define the requirements
                            keyword statement in your job command file specifying the desired Arch or
                            OpSys. For example:
                              Requirements: (Arch == "RS6000") && (OpSys == "AIX53")
                        v Deleted information:
                          The following function is no longer supported and the information has been
                          removed:
                          – The scheduling of parallel jobs with the default scheduler
                            (SCHEDULER_TYPE=LL_DEFAULT)
                          – The min_processors and max_processors keywords
                          – The RSET_CONSUMABLE_CPUS option for the rset_support configuration
                            keyword and the rset job command file keyword
                          – The API functions:
                            - ll_get_nodes
                            - ll_free_nodes
                            - ll_get_jobs
                            - ll_free_jobs
                            - ll_start_job
                          – Red Hat Enterprise Linux 3
                          – The llctl purgeschedd function has been replaced by the llmovespool
                            function.
                          – The lldbconvert function is no longer needed for migration and the
                            lldbconvert command is not included in TWS LoadLeveler 3.5.



xviii   TWS LoadLeveler: Using and Administering
Part 1. Overview of TWS LoadLeveler concepts and operation
            Setting up IBM Tivoli Workload Scheduler (TWS) LoadLeveler involves defining
            machines, users, jobs, and how they interact, in such a way that TWS LoadLeveler
            is able to run jobs quickly and efficiently.

            Once you have a basic understanding of the TWS LoadLeveler product and its
            interfaces, you can find more details in the topics listed in Table 2.
            Table 2. Major topics in TWS LoadLeveler: Using and Administering
            To learn about:                          Read the following:
            Performing administrator tasks           Part 2, “Configuring and managing the TWS
                                                     LoadLeveler environment,” on page 39
            Performing general user tasks            Part 3, “Submitting and managing TWS
                                                     LoadLeveler jobs,” on page 177
            Using TWS LoadLeveler interfaces         Part 4, “TWS LoadLeveler interfaces reference,” on
                                                     page 261




                                                                                                     1
2   TWS LoadLeveler: Using and Administering
Chapter 1. What is LoadLeveler?
            LoadLeveler is a job management system that allows users to run more jobs in less
            time by matching the jobs’ processing needs with the available resources.
            LoadLeveler schedules jobs, and provides functions for building, submitting, and
            processing jobs quickly and efficiently in a dynamic environment.

            Figure 1 shows the different environments to which LoadLeveler can schedule jobs.
            Together, these environments comprise the LoadLeveler cluster.


            LoadLeveler
            cluster



                                        IBM Power Systems
                                        running AIX

             Submit-only
             workstations


                                                                           IBM eServer Cluster 1350
                                                                           running Linux

                                                      IBM BladeCenter
                                                      running Linux

            Figure 1. Example of a LoadLeveler cluster

            As Figure 1 also illustrates, a LoadLeveler cluster can include submit-only machines,
            which allow users to have access to a limited number of LoadLeveler features.

            Throughout all the topics, the terms workstation, machine, node, and operating system
            instance (OSI) refer to the machines in your cluster. In LoadLeveler, an OSI is
            treated as a single instance of an operating system image.

            If you are unfamiliar with the TWS LoadLeveler product, consider reading one or
            more of the introductory topics listed in Table 3:
            Table 3. Topics in the TWS LoadLeveler overview
            To learn about:                             Read the following:
            Using the default configuration for         Chapter 2, “Getting a quick start using the default
            getting a quick start                       configuration,” on page 29
            Specific products and features that are     Chapter 3, “What operating systems are supported
            required for or available through the       by LoadLeveler?,” on page 35
            TWS LoadLeveler environment




                                                                                                          3
LoadLeveler basics
                         LoadLeveler has various types of interfaces that enable users to create and submit
                         jobs and allow system administrators to configure the system and control running
                         jobs.

                         These interfaces include:
                         v Control files that define the elements, characteristics, and policies of LoadLeveler
                           and the jobs it manages. These files are the configuration file, the administration
                           file, and job command file.
                         v The command line interface, which gives you access to basic job and
                           administrative functions.
                         v A graphical user interface (GUI), which provides system access similar to the
                           command line interface. Experienced users and administrators may find the
                           command line interface more efficient than the GUI for job and administrative
                           functions.
                         v An application programming interface (API), which allows application programs
                           written by users and administrators to interact with the LoadLeveler
                           environment.

                         The commands, GUI, and APIs permit different levels of access to administrators
                         and users. User access is typically restricted to submitting and managing
                         individual jobs, while administrative access allows setting up system
                         configurations, job scheduling, and accounting.

                         Using either the command line or the GUI, users create job command files that
                         instruct the system on how to process information. Each job command file consists
                         of keywords followed by the user defined association for that keyword. For
                         example, the keyword executable tells LoadLeveler that you are about to define
                         the name of a program you want to run. Therefore, executable = longjob tells
                         LoadLeveler to run the program called longjob.

                         After creating the job command file, you invoke LoadLeveler commands to
                         monitor and control the job as it moves through the system. LoadLeveler monitors
                         each job as it moves through the system using process control daemons. However,
                         the administrator maintains ultimate control over all LoadLeveler jobs by defining
                         job classes that control how and when LoadLeveler will run a job.

                         In addition to setting up job classes, the administrator can also control how jobs
                         move through the system by specifying the type of scheduler. LoadLeveler has
                         several different scheduler options that start jobs using specific algorithms to
                         balance job priority with available machine resources.

                         When LoadLeveler administrators are configuring clusters and when users are
                         planning jobs, they need to be aware of the machine resources available in the
                         cluster. These resources include items like the number of CPUs and the amount of
                         memory available for each job. Because resource availability will vary over time,
                         LoadLeveler defines them as consumable resources.

LoadLeveler: A network job management and scheduling system
                         A network job management and job scheduling system, such as LoadLeveler, is a
                         software program that schedules and manages jobs that you submit to one or more
                         machines under its control.


4   TWS LoadLeveler: Using and Administering
LoadLeveler accepts jobs that users submit and reviews the job requirements.
      LoadLeveler then examines the machines under its control to determine which
      machines are best suited to run each job.

Job definition
      LoadLeveler schedules your jobs on one or more machines for processing. The
      definition of a job, in this context, is a set of job steps.

      or each job step, you can specify a different executable (the executable is the part
      of the job that gets processed). You can use LoadLeveler to submit jobs which are
      made up of one or more job steps, where each job step depends upon the
      completion status of a previous job step. For example, Figure 2 illustrates a stream
      of job steps:


                                        1. Copy data from tape
                                        2. Check exit status

       Job
       job command file                                               exit status = y

                                    exit status = x
       Q Job step 1
       Q Job step 2                      1. Process data
                                         2. Check exit status

       Q Job step 3                                                   exit status = y

                                    exit status = x


                                        Format and print results              End program


      Figure 2. LoadLeveler job steps

      Each of these job steps is defined in a single job command file. A job command
      file specifies the name of the job, as well as the job steps that you want to submit,
      and can contain other LoadLeveler statements.

      LoadLeveler tries to execute each of your job steps on a machine that has enough
      resources to support executing and checkpointing each step. If your job command
      file has multiple job steps, the job steps will not necessarily run on the same
      machine, unless you explicitly request that they do.

      You can submit batch jobs to LoadLeveler for scheduling. Batch jobs run in the
      background and generally do not require any input from the user. Batch jobs can
      either be serial or parallel. A serial job runs on a single machine. A parallel job is a
      program designed to execute as a number of individual, but related, processes on
      one or more of your system’s nodes. When executed, these related processes can
      communicate with each other (through message passing or shared memory) to
      exchange data or synchronize their execution.

      For parallel jobs, LoadLeveler interacts with Parallel Operating Environment (POE)
      to allocate nodes, assign tasks to nodes, and launch tasks.



                                                                   Chapter 1. What is LoadLeveler?   5
Machine definition
                         For LoadLeveler to schedule a job on a machine, the machine must be a valid
                         member of the LoadLeveler cluster.

                         A cluster is the combination of all of the different types of machines that use
                         LoadLeveler.

                         To make a machine a member of the LoadLeveler cluster, the administrator has to
                         install the LoadLeveler software onto the machine and identify the central manager
                         (described in “Roles of machines”). Once a machine becomes a valid member of
                         the cluster, LoadLeveler can schedule jobs to it.

                         Roles of machines
                         Each machine in the LoadLeveler cluster performs one or more roles in scheduling
                         jobs.

                         Roles performed in scheduling jobs by each machine in the LoadLeveler cluster are
                         as follows:
                         v Scheduling Machine: When a job is submitted, it gets placed in a queue
                           managed by a scheduling machine. This machine contacts another machine that
                           serves as the central manager for the entire LoadLeveler cluster. This scheduling
                           machine asks the central manager to find a machine that can run the job, and
                           also keeps persistent information about the job. Some scheduling machines are
                           known as public scheduling machines, meaning that any LoadLeveler user can
                           access them. These machines schedule jobs submitted from submit-only
                           machines:
                         v Central Manager Machine: The role of the central manager is to examine the
                           job’s requirements and find one or more machines in the LoadLeveler cluster
                           that will run the job. Once it finds the machine(s), it notifies the scheduling
                           machine.
                         v Executing Machine: The machine that runs the job is known as the executing
                           machine.
                         v Submitting Machine: This type of machine is known as a submit-only machine.
                           It participates in the LoadLeveler cluster on a limited basis. Although the name
                           implies that users of these machines can only submit jobs, they can also query
                           and cancel jobs. Users of these machines also have their own Graphical User
                           Interface (GUI) that provides them with the submit-only subset of functions. The
                           submit-only machine feature allows workstations that are not part of the
                           LoadLeveler cluster to submit jobs to the cluster.
                         Keep in mind that one machine can assume multiple roles, as shown in Figure 3 on
                         page 7.




6   TWS LoadLeveler: Using and Administering
Scheduling
                                                     machine

                                                    Executing
             LoadLeveler                            machine

             cluster                                Central
                                                    manager
                                                                  Scheduling
                                                                   machine

                        Submit-only                               Executing
                         machines                                 machine

                                                     Scheduling
                                                      machine

                                                     Executing
                                                     machine

             Figure 3. Multiple roles of machines



             Machine availability
             There may be times when some of the machines in the LoadLeveler cluster are not
             available to process jobs

             For instance, when the owners of the machines have decided to make them
             unavailable. This ability of LoadLeveler to allow users to restrict the use of their
             machines provides flexibility and control over the resources.

             Machine owners can make their personal workstations available to other
             LoadLeveler users in several ways. For example, you can specify that:
             v The machine will always be available
             v The machine will be available only between certain hours
             v The machine will be available when the keyboard and mouse are not being used
               interactively.
             Owners can also specify that their personal workstations never be made available
             to other LoadLeveler users.

How LoadLeveler schedules jobs
             When a user submits a job, LoadLeveler examines the job command file to
             determine what resources the job will need. LoadLeveler determines which
             machine, or group of machines, is best suited to provide these resources, then
             LoadLeveler dispatches the job to the appropriate machines. To aid this process,
             LoadLeveler uses queues.

             A job queue is a list of jobs that are waiting to be processed. When a user submits
             a job to LoadLeveler, the job is entered into an internal database, which resides on
             one of the machines in the LoadLeveler cluster, until it is ready to be dispatched to
             run on another machine.




                                                                      Chapter 1. What is LoadLeveler?   7
Once LoadLeveler examines a job to determine its required resources, the job is
                         dispatched to a machine to be processed. A job can be dispatched to either one
                         machine, or in the case of parallel jobs, to multiple machines. Once the job reaches
                         the executing machine, the job runs.

                         Jobs do not necessarily get dispatched to machines in the cluster on a first-come,
                         first-serve basis. Instead, LoadLeveler examines the requirements and
                         characteristics of the job and the availability of machines, and then determines the
                         best time for the job to be dispatched.

                         LoadLeveler also uses job classes to schedule jobs to run on machines. A job class
                         is a classification to which a job can belong. For example, short running jobs may
                         belong to a job class called short_jobs. Similarly, jobs that are only allowed to run
                         on the weekends may belong to a class called weekend. The system administrator
                         can define these job classes and select the users that are authorized to submit jobs
                         of these classes.

                         You can specify which types of jobs will run on a machine by specifying the types
                         of job classes the machine will support. LoadLeveler also examines a job’s priority
                         to determine when to schedule the job on a machine. A priority of a job is used to
                         determine its position among a list of all jobs waiting to be dispatched.

                         “The LoadLeveler job cycle” on page 16 describes job flow in the LoadLeveler
                         environment in more detail.

How LoadLeveler daemons process jobs
                         LoadLeveler has its own set of daemons that control the processes moving jobs
                         through the LoadLeveler cluster.

                         The LoadLeveler daemons are programs that run continuously and control the
                         processes that move jobs through the LoadLeveler cluster. A master daemon
                         (LoadL_master) runs on all machines in the LoadLeveler cluster and manages
                         other daemons.

                         Table 4 summarizes these daemons, which are described in further detail in topics
                         immediately following the table.
                         Table 4. LoadLeveler daemons
                         Daemon                         Description
                         LoadL_master                   Referred to as the master daemon. Runs on all machines in
                                                        the LoadLeveler cluster and manages other daemons.
                         LoadL_schedd                   Referred to as the Schedd daemon. Receives jobs from the
                                                        llsubmit command and manages them on machines
                                                        selected by the negotiator daemon (as defined by the
                                                        administrator).
                         LoadL_startd                   Referred to as the startd daemon. Monitors job and
                                                        machine resources on local machines and forwards
                                                        information to the negotiator daemon.

                                                        The startd daemon spawns the starter process
                                                        (LoadL_starter) which manages running jobs on the
                                                        executing machine.




8   TWS LoadLeveler: Using and Administering
Table 4. LoadLeveler daemons (continued)
     Daemon                        Description
     LoadL_negotiator              Referred to as the negotiator daemon. Monitors the status
                                   of each job and machine in the cluster. Responds to queries
                                   from llstatus and llq commands. Runs on the central
                                   manager machine.
     LoadL_kbdd                    Referred to as the keyboard daemon. Monitors keyboard
                                   and mouse activity.
     LoadL_GSmonitor               Referred to as the gsmonitor daemon. Monitors for down
                                   machines based on the heartbeat responses of the
                                   MACHINE_UPDATE_INTERVAL time period.



The master daemon
     The master daemon runs on every machine in the LoadLeveler cluster, except the
     submit-only machines. The real and effective user ID of this daemon must be root.

     The LoadL_master binary is installed as a setuid program with the owner set to
     root. The master daemon and all daemons started by the master must be able to
     run with root privileges in order to switch the identity to the owner of any job
     being processed.

     The master daemon determines whether to start any other daemons by checking
     the START_DAEMONS keyword in the global or local configuration file. If the
     keyword is set to true, the daemons are started. If the keyword is set to false, the
     master daemon terminates and generates a message.

     The master daemon will not start on a Linux machine if SEC_ENABLEMENT is
     set to CTSEC. If the master daemon does not start, no other daemons will start.

     On the machine designated as the central manager, the master runs the negotiator
     daemon. The master also controls the central manager backup function. The
     negotiator runs on either the primary or an alternate central manager. If a central
     manager failure is detected, one of the alternate central managers becomes the
     primary central manager by starting the negotiator.

     The master daemon starts and if necessary, restarts all of the LoadLeveler daemons
     that the machine it resides on is configured to run. As part of its startup procedure,
     this daemon executes the .llrc file (a dummy file is provided in the bin
     subdirectory of the release directory). You can use this script to customize your
     local configuration file, specifying what particular data is stored locally. This
     daemon also runs the kbdd daemon, which monitors keyboard and mouse activity.

     When the master daemon detects a failure on one of the daemons that it is
     monitoring, it attempts to restart it. Because this daemon recognizes that certain
     situations may prevent a daemon from running, it limits its restart attempts to the
     number defined for the RESTARTS_PER_HOUR keyword in the configuration file.
     If this limit is exceeded, the master daemon forces all daemons including itself to
     exit.

     When a daemon must be restarted, the master sends mail to the administrators
     identified by the LOADL_ADMIN keyword in the configuration file. The mail
     contains the name of the failing daemon, its termination status, and a section of the
     daemon’s most recent log file. If the master aborts after exceeding
     RESTARTS_PER_HOUR, it will also send that mail before exiting.

                                                             Chapter 1. What is LoadLeveler?   9
The master daemon may perform the following actions in response to an llctl
                        command:
                        v Kill all daemons and exit (stop keyword)
                        v Kill all daemons and execute a new master (recycle keyword)
                        v Rerun the .llrc file, reread the configuration files, stop or start daemons as
                          appropriate for the new configuration files (reconfig keyword)
                        v Send drain request to startd and (drain keyword)
                        v Send flush request to startd and send result to caller (flush keyword)
                        v Send suspend request to startd and send result to caller (suspend keyword)
                        v Send resume request to startd and Schedd, and send result to caller (resume
                          keyword)

             The Schedd daemon
                        The Schedd daemon receives jobs sent by the llsubmit command and manages
                        those jobs to machines selected by the negotiator daemon. The Schedd daemon is
                        started, restarted, signalled, and stopped by the master daemon.

                        The Schedd daemon can be in any one of the following activity states:
                        Available
                               This machine is available to schedule jobs.
                        Drained
                               The Schedd machine accepts no more jobs. There are no jobs in starting or
                               running state. Jobs in the Idle state are drained, meaning they will not get
                               dispatched.
                        Draining
                               The Schedd daemon is being drained by the administrator but some jobs
                               are still running. The state of the machine remains Draining until all
                               running jobs complete. At that time, the machine status changes to
                               Drained.
                        Down The daemon is not running on this machine. The Schedd daemon enters
                             this state when it has not reported its status to the negotiator. This can
                             occur when the machine is actually down, or because there is a network
                             failure.

                        The Schedd daemon performs the following functions:
                        v Assigns new job identifiers when requested by the job submission process (for
                          example, by the llsubmit command).
                        v Receives new jobs from the llsubmit command. A new job is received as a job
                          object for each job step. A job object is the data structure in memory containing
                          all the information about a job step. The Schedd forwards the job object to the
                          negotiator daemon as soon as it is received from the submit command.
                        v Maintains on disk copies of jobs submitted locally (on this machine) that are
                          either waiting or running on a remote (different) machine. The central manager
                          can use this information to reconstruct the job information in the event of a
                          failure. This information is also used for accounting purposes.
                        v Responds to directives sent by the administrator through the negotiator daemon.
                          The directives include:
                          – Run a job.
                          – Change the priority of a job.
                          – Remove a job.
                          – Hold or release a job.
                          – Send information about all jobs.


10   TWS LoadLeveler: Using and Administering
v Sends job events to the negotiator daemon when:
        – Schedd is restarting.
        – A new series of job objects are arriving.
        – A job is started.
        – A job was rejected, completed, removed, or vacated. Schedd determines the
           status by examining the exit status returned by the startd.
      v Communicates with the Parallel Operating Environment (POE) when you run an
        interactive POE job.
      v Requests that a remote startd daemon end a job.
      v Receives accounting information from startd.
      v Receives requests for reservations.
      v Collects resource usage data when jobs terminate and stores it as historic fair
        share data in the $(SPOOL) directory.
      v Sends historic fair share data to the central manager when it is updated or when
        the Schedd daemon is restarted.
      v Maintains and stores records of historic CPU and IBM System Blue Gene®
        Solution utilization for users and groups known to the Schedd.
      v Passes the historic CPU and Blue Gene utilization data to the central manager.

The startd daemon
      The startd daemon monitors the status of each job, reservation, and machine in the
      cluster, and forwards this information to the negotiator daemon.

      The startd also receives and executes job requests originating from remote
      machines. The master daemon starts, restarts, signals, and stops the startd daemon.

      Checkpoint/restart is not supported in LoadLeveler for Linux. If a checkpointed
      job is sent to a Linux node, the Linux node will reject the job.

      The startd daemon can be in any one of the following states:
      Busy    The maximum number of jobs are running on this machine as specified by
              the MAX_STARTERS configuration keyword.
      Down The daemon is not running on this machine. The startd daemon enters this
           state when it has not reported its status to the negotiator. This can occur
           when the machine is actually down, or because there is a network failure.
      Drained
             The startd machine will not accept any new jobs. No jobs are running
             when startd is in the drained state.
      Draining
             The startd daemon is being drained by the administrator, but some jobs are
             still running. The machine remains in the draining state until all of the
             running jobs have completed, at which time the machine status changes to
             drained. The startd daemon will not accept any new jobs while in the
             draining state.
      Flush   Any running jobs have been vacated (terminated and returned to the
              queue to be redispatched). The startd daemon will not accept any new
              jobs.
      Idle    The machine is not running any jobs.
      None    LoadLeveler is running on this machine, but no jobs can run here.


                                                          Chapter 1. What is LoadLeveler?   11
Running
                              The machine is running one or more jobs and is capable of running more.
                        Suspend
                              All LoadLeveler jobs running on this machine are stopped (cease
                              processing), but remain in virtual memory. The startd daemon will not
                              accept any new jobs.

                        The startd daemon performs these functions:
                        v Runs a time-out procedure that includes building a snapshot of the state of the
                          machine that includes static and dynamic data. This time-out procedure is run at
                          the following times:
                          – After a job completes.
                          – According to the definition of the POLLING_FREQUENCY keyword in the
                             configuration file.
                        v Records the following information in LoadLeveler variables and sends the
                          information to the negotiator.
                          – State (of the startd daemon)
                          – EnteredCurrentState
                          – Memory
                          – Disk
                          – KeyboardIdle
                          – Cpus
                          – LoadAvg
                          – Machine
                          – Adapter
                          – AvailableClasses
                        v Calculates the SUSPEND, RESUME, CONTINUE, and VACATE expressions
                          through which you can manage job status.
                        v Receives job requests from the Schedd daemon to:
                          – Start a job
                          – Preempt or resume a job
                          – Vacate a job
                          – Cancel
                          When the Schedd daemon tells the startd daemon to start a job, the startd
                          determines whether its own state permits a new job to run:
                        Table 5. startd determines whether its own state permits a new job to run
                        If:                       Then this happens:
                        Yes, it can start a new   The startd forks a starter process.
                        job
                        No, it cannot start a     The startd rejects the request for one of the following reasons:
                        new job                   v Jobs have been suspended, flushed, or drained
                                                  v The job limit set for the MAX_STARTERS keyword has been
                                                    reached
                                                  v There are not enough classes available for the designated job class

                        v Receives requests from the master (through the llctl command) to do one of the
                          following:
                          – Drain (drain keyword)
                          – Flush (flush keyword)
                          – Suspend (suspend keyword)
                          – Resume (resume keyword)


12   TWS LoadLeveler: Using and Administering
v For each request, startd marks its own new state, forwards its new state to the
        negotiator daemon, and then performs the appropriate action for any jobs that
        are active.
      v Receives notification of keyboard and mouse activity from the kbdd daemon
      v Periodically examines the process table for LoadLeveler jobs and accumulates
        resources consumed by those jobs. This resource data is used to determine if a
        job has exceeded its job limit and for recording in the history file.
      v Send accounting information to Schedd.

      The starter process
      The startd daemon spawns a starter process after the Schedd daemon tells the
      startd daemon to start a job.

      The starter process manages all the processes associated with a job step. The starter
      process is responsible for running the job and reporting status back to the startd
      daemon.

      The starter process performs these functions:
      v Processes the prolog and epilog programs as defined by the JOB_PROLOG and
        JOB_EPILOG keywords in the configuration file. The job will not run if the
        prolog program exits with a return code other than zero.
      v Handles authentication. This includes:
        – Authenticates AFS, if necessary
        – Verifies that the submitting user is not root
        – Verifies that the submitting user has access to the appropriate directories in
           the local file system.
      v Runs the job by forking a child process that runs with the user ID and all
        groups of the submitting user. That child process creates a new process group of
        which it is the process group leader, and executes the user’s program or a shell.
        The starter process is responsible for detecting the termination of any process
        that it forks. To ensure that all processes associated with a job are terminated
        after the process forked by the starter terminates, process tracking must be
        enabled. To configure LoadLeveler for process tracking, see “Tracking job
        processes” on page 70.
      v Responds to vacate and suspend orders from the startd.

The negotiator daemon
      The negotiator daemon maintains status of each job and machine in the cluster
      and responds to queries from the llstatus and llq commands.

      The negotiator daemon runs on a single machine in the cluster (the central
      manager machine). This daemon is started, restarted, signalled, and stopped by the
      master daemon.

      In a mixed cluster, the negotiator daemon must run on an AIX node.

      The negotiator daemon receives status messages from each Schedd and startd
      daemon running in the cluster. The negotiator daemon tracks:
      v Which Schedd daemons are running
      v Which startd daemons are running, and the status of each startd machine.




                                                           Chapter 1. What is LoadLeveler?   13
If the negotiator does not receive an update from any machine within the time
                        period defined by the MACHINE_UPDATE_INTERVAL keyword, then the
                        negotiator assumes that the machine is down, and therefore the Schedd and startd
                        daemons are also down.

                        The negotiator also maintains in its memory several queues and tables which
                        determine where the job should run.

                        The negotiator performs the following functions:
                        v Receives and records job status changes from the Schedd daemon.
                        v Schedules jobs based on a variety of scheduling criteria and policy options. Once
                          a job is selected, the negotiator contacts the Schedd that originally created the
                          job.
                        v Handles requests to:
                          – Set priorities
                          – Query about jobs, machines, classes, and reservations
                          – Change reservation attributes
                          – Bind jobs to reservations
                          – Remove a reservation
                          – Remove a job
                          – Hold or release a job
                          – Favor or unfavor a user or a job.
                        v Receives notification of Schedd resets indicating that a Schedd has restarted.

             The kbdd daemon
                        The kbdd daemon monitors keyboard and mouse activity.

                        The kbdd daemon is spawned by the master daemon if the X_RUNS_HERE
                        keyword in the configuration file is set to true.

                        The kbdd daemon notifies the startd daemon when it detects keyboard or mouse
                        activity; however, kbdd is not interrupt driven. It sleeps for the number of seconds
                        defined by the POLLING_FREQUENCY keyword in the LoadLeveler
                        configuration file, and then determines if X events, in the form of mouse or
                        keyboard activity, have occurred. For more information on the configuration file,
                        see Chapter 5, “Defining LoadLeveler resources to administer,” on page 83.

             The gsmonitor daemon
                        The gsmonitor daemon is not available in LoadLeveler for Linux.

                        The negotiator daemon monitors for down machines based on the heartbeat
                        responses of the MACHINE_UPDATE_INTERVAL time period. If the negotiator
                        has not received an update after two MACHINE_UPDATE_INTERVAL periods,
                        then it marks the machine as down, and notifies the Schedd to remove any jobs
                        running on that machine. The gsmonitor daemon (LoadL_GSmonitor) allows this
                        cleanup to occur more reliably. The gsmonitor daemon uses the Group Services
                        Application Programming Interface (GSAPI) to monitor machine availability on
                        peer domains and to notify the negotiator quickly when a machine is no longer
                        reachable.

                        If the GSMONITOR_DOMAIN keyword was not specified in the LoadLeveler
                        configuration file, then LoadLeveler will try to determine if the machine is running
                        in a peer (cluster) domain. The gsmonitor must run in a peer domain. The


14   TWS LoadLeveler: Using and Administering
gsmonitor will detect that it is running in an active peer domain, then it will use
the RMC API to determine the node numbers and names of machines running in
the cluster.

If the administrator sets up a LoadLeveler administration file that contains OSIs
spanning several peer domains then a gsmonitor daemon must be started in each
domain. A gsmonitor daemon can monitor only the OSIs contained in the domain
within which it is running. The administrator specifies which OSIs run the
gsmonitor daemon by specifying GSMONITOR_RUNS_HERE=TRUE in the local
configuration file for that OSI. The default for GSMONITOR_RUNS_HERE is
False.

The gsmonitor daemon should be run on one or two nodes in the peer domain. By
running LoadL_GSmonitor on more than one node in a domain you will have a
backup in case one of the nodes that the monitor is running on goes down.
LoadL_GSmonitor subscribes to the Group Services system-defined host
membership group, which is represented by the HA_GS_HOST_MEMBERSHIP
Group Services keyword. This group monitors every configured node in the
system partition and every node in the active peer domain.

Note:
        1. The Group Services routines need to be run as root, so the
           LoadL_GSmonitor executable must be owned by root and have the
           setuid permission bit enabled.
        2. It will not cause a problem to run more than one LoadL_GSmonitor
           daemon per peer domain, this will just cause the negotiator to be
           notified by each running daemon.
        3. For more information about the Group Services subsystem, see the RSCT
           Administration Guide, SA22-7889 for peer domains.
        4. For more information about GSAPI, see Group Services Programming Guide
           and Reference, SA22-7355.




                                                     Chapter 1. What is LoadLeveler?   15
The LoadLeveler job cycle
                        To illustrate the flow of job information through the LoadLeveler cluster, a
                        description and sequence of diagrams have been provided.



                                                                 Scheduling
                                                                  machine

                                                                  Executing
                                                                  machine

                                                                    Central
                                                                    manager



                                                                2        3
                                                                                                 Scheduling
                                                                                                  machine
                                       1 Job                   Scheduling
                                                                              4
                                                                machine                           Executing
                                                                                                  machine
                                                                Executing
                                                                machine

                        Figure 4. High-level job flow

                        The managing machine in a LoadLeveler cluster is known as the central manager.
                        There are also machines that act as schedulers, and machines that serve as the
                        executing machines. The arrows in Figure 4 illustrate the following:
                        v Arrow 1 indicates that a job has been submitted to LoadLeveler.
                        v Arrow 2 indicates that the scheduling machine contacts the central manager to
                          inform it that a job has been submitted, and to find out if a machine exists that
                          matches the job requirements.
                        v Arrow 3 indicates that the central manager checks to determine if a machine
                          exists that is capable of running the job. Once a machine is found, the central
                          manager informs the scheduling machine which machine is available.
                        v Arrow 4 indicates that the scheduling machine contacts the executing machine
                          and provides it with information regarding the job. In this case, the scheduling
                          and executing machines are different machines in the cluster, but they do not
                          have to be different; the scheduling and executing machines may be the same
                          physical machine.

                        Figure 4 is broken down into the following more detailed diagrams illustrating
                        how LoadLeveler processes a job. The diagrams indicate specific job states for this
                        example, but do not list all of the possible states for LoadLeveler jobs. A complete
                        list of job states appears in “LoadLeveler job states” on page 19.
                        1. Submit a LoadLeveler job:




16   TWS LoadLeveler: Using and Administering
Central manager
LoadLeveler                                negotiator daemon
cluster
                                               3



                                  Scheduling
                                  machine
                 1
                      Q
                                      schedd daemon

                                      2

                                         Q
                                          Q
                                          Q Idle


Figure 5. Job is submitted to LoadLeveler

   Figure 5 illustrates that the Schedd daemon runs on the scheduling machine.
   This machine can also have the startd daemon running on it. The negotiator
   daemon resides on the central manager machine. The arrows in Figure 5
   illustrate the following:
   v Arrow 1 indicates that a job has been submitted to the scheduling machine.
   v Arrow 2 indicates that the Schedd daemon, on the scheduling machine,
      stores all of the relevant job information on local disk.
   v Arrow 3 indicates that the Schedd daemon sends job description information
      to the negotiator daemon. At this point, the submitted job is in the Idle state.
2. Permit to run:



                          Central manager
                            negotiator daemon




                                     4

                     Scheduling
                     machine

                          schedd daemon



                          Q
                           Q
                            Q   Pending or Starting



Figure 6. LoadLeveler authorizes the job




                                                               Chapter 1. What is LoadLeveler?   17
In Figure 6 on page 17, arrow 4 indicates that the negotiator daemon authorizes
                           the Schedd daemon to begin taking steps to run the job. This authorization is
                           called a permit to run. Once this is done, the job is considered Pending or
                           Starting.
                        3. Prepare to run:



                                                 Central manager
                                                    negotiator daemon




                                            Scheduling
                                            machine                                 Executing machine
                                                                       remote   5        startd daemon
                                                schedd daemon

                                                               local

                                                Q         startd daemon
                                                 Q
                                                 Q   Pending or Starting




                        Figure 7. LoadLeveler prepares to run the job

                           In Figure 7, arrow 5 illustrates that the Schedd daemon contacts the startd
                           daemon on the executing machine and requests that it start the job. The
                           executing machine can either be a local machine (the machine to which the job
                           was submitted) or another machine in the cluster. In this example, the local
                           machine is not the executing machine.
                        4. Initiate job:



                                                 Central manager
                                                    negotiator daemon



                                                               8
                                            Scheduling
                                            machine                                 Executing machine

                                                schedd daemon                            startd daemon
                                                                                7
                                                                                            6

                                                Q
                                                                                                 Q
                                                                                       starter
                                                                                                   1010
                                                                                                  1010
                                                 Q                                               1010
                                                                                                 101010
                                                 Q   Running




                        Figure 8. LoadLeveler starts the job



18   TWS LoadLeveler: Using and Administering
The arrows in Figure 8 on page 18 illustrate the following:
         v Arrow 6 indicates that the startd daemon on the executing machine spawns a
           starter process for the job.
         v Arrow 7 indicates that the Schedd daemon sends the starter process the job
           information and the executable.
         v Arrow 8 indicates that the Schedd daemon notifies the negotiator daemon
           that the job has been started and the negotiator daemon marks the job as
           Running.
         The starter forks and executes the user’s job, and the starter parent waits for
         the child to complete.
      5. Complete job:



                             Central manager
                                 negotiator daemon



                                            11

                        Scheduling
                        machine                              Executing machine

                             schedd daemon              10          startd daemon

                                                                       9

                            Q
                                                                            Q
                                                                  starter

                             Q
                              Q   Complete pending or
                                  Completed



      Figure 9. LoadLeveler completes the job

         The arrows in Figure 9 illustrate the following:
         v Arrow 9 indicates that when the job completes, the starter process notifies
           the startd daemon.
         v Arrow 10 indicates that the startd daemon notifies the Schedd daemon.
         v Arrow 11 indicates that the Schedd daemon examines the information it has
           received, and forwards it to the negotiator daemon. At this point, the job is
           in Completed or Complete Pending state.

LoadLeveler job states
      As LoadLeveler processes a job, the job moves through various states.

      These states are listed in Table 6 on page 20. Job states that include “Pending,”
      such as Complete Pending and Vacate Pending, are intermediate, temporary states.

      Some options on LoadLeveler interfaces are valid only for jobs in certain states. For
      example, the llmodify command has options that apply only to jobs that are in the
      Idle state, or in states that are similar to it. To determine which job states are
      similar to the Idle state, use the “Similar to...” column in Table 6 on page 20, which




                                                             Chapter 1. What is LoadLeveler?   19
indicates whether a particular job state is similar to the Idle, Running, or
                        Terminating state. A dash (—) indicates that the state is not similar to an Idle,
                        Running, or Terminating state.
                        Table 6. Job state descriptions and abbreviations
                        Job state          Similar to     Abbreviation      Description
                                           Idle or        in displays /
                                           Running        output
                                           state?
                        Canceled           Terminating    CA                The job was canceled either by a user or
                                                                            by an administrator.
                        Checkpointing      Running        CK                Indicates that a checkpoint has been
                                                                            initiated.
                        Completed          Terminating    C                 The job has completed.
                        Complete           Terminating    CP                The job is in the process of being
                        Pending                                             completed.
                        Deferred           Idle           D                 The job will not be assigned to a machine
                                                                            until a specified date. This date may have
                                                                            been specified by the user in the job
                                                                            command file, or may have been
                                                                            generated by the negotiator because a
                                                                            parallel job did not accumulate enough
                                                                            machines to run the job. Only the
                                                                            negotiator places a job in the Deferred
                                                                            state.
                        Idle               Idle           I                 The job is being considered to run on a
                                                                            machine, though no machine has been
                                                                            selected.
                        Not Queued         Idle           NQ                The job is not being considered to run on
                                                                            a machine. A job can enter this state
                                                                            because the associated Schedd is down,
                                                                            the user or group associated with the job
                                                                            is at its maximum maxqueued or maxidle
                                                                            value, or because the job has a
                                                                            dependency which cannot be determined.
                                                                            For more information on these keywords,
                                                                            see “Controlling the mix of idle and
                                                                            running jobs” on page 721. (Only the
                                                                            negotiator places a job in the NotQueued
                                                                            state.)
                        Not Run            —              NR                The job will never be run because a
                                                                            dependency associated with the job was
                                                                            found to be false.
                        Pending            Running        P                 The job is in the process of starting on one
                                                                            or more machines. (The negotiator
                                                                            indicates this state until the Schedd
                                                                            acknowledges that it has received the
                                                                            request to start the job. Then the
                                                                            negotiator changes the state of the job to
                                                                            Starting. The Schedd indicates the
                                                                            Pending state until all startd machines
                                                                            have acknowledged receipt of the start
                                                                            request. The Schedd then changes the
                                                                            state of the job to Starting.)




20   TWS LoadLeveler: Using and Administering
Table 6. Job state descriptions and abbreviations (continued)
Job state         Similar to      Abbreviation    Description
                  Idle or         in displays /
                  Running         output
                  state?
Preempted         Running         E               The job is preempted. This state applies
                                                  only when LoadLeveler uses the suspend
                                                  method to preempt the job.
Preempt           Running         EP              The job is in the process of being
Pending                                           preempted. This state applies only when
                                                  LoadLeveler uses the suspend method to
                                                  preempt the job.
Rejected          Idle            X               The job is rejected.
Reject Pending    Idle            XP              The job did not start. Possible reasons
                                                  why a job is rejected are: job requirements
                                                  were not met on the target machine, or
                                                  the user ID of the person running the job
                                                  is not valid on the target machine. After a
                                                  job leaves the Reject Pending state, it is
                                                  moved into one of the following states:
                                                  Idle, User Hold, or Removed.
Removed           Terminating     RM              The job was stopped by LoadLeveler.
Remove            Terminating     RP              The job is in the process of being
Pending                                           removed, but not all associated machines
                                                  have acknowledged the removal of the
                                                  job.
Resume Pending Running            MP              The job is in the process of being
                                                  resumed.
Running           Running         R               The job is running: the job was dispatched
                                                  and has started on the designated
                                                  machine.
Starting          Running         ST              The job is starting: the job was dispatched,
                                                  was received by the target machine, and
                                                  LoadLeveler is setting up the environment
                                                  in which to run the job. For a parallel job,
                                                  LoadLeveler sets up the environment on
                                                  all required nodes. See the description of
                                                  the “Pending” state for more information
                                                  on when the negotiator or the Schedd
                                                  daemon moves a job into the Starting
                                                  state.
System Hold       Idle            S               The job has been put in system hold.




                                                          Chapter 1. What is LoadLeveler?   21
Table 6. Job state descriptions and abbreviations (continued)
                            Job state          Similar to     Abbreviation    Description
                                               Idle or        in displays /
                                               Running        output
                                               state?
                            Terminated         Terminating    TX              If the negotiator and Schedd daemons
                                                                              experience communication problems, they
                                                                              may be temporarily unable to exchange
                                                                              information concerning the status of jobs
                                                                              in the system. During this period of time,
                                                                              some of the jobs may actually complete
                                                                              and therefore be removed from the
                                                                              Schedd’s list of active jobs. When
                                                                              communication resumes between the two
                                                                              daemons, the negotiator will move such
                                                                              jobs to the Terminated state, where they
                                                                              will remain for a set period of time
                                                                              (specified by the
                                                                              NEGOTIATOR_REMOVE_COMPLETED
                                                                              keyword in the configuration file). When
                                                                              this time has passed, the negotiator will
                                                                              remove the jobs from its active list.
                            User & System      Idle           HS              The job has been put in both system hold
                            Hold                                              and user hold.
                            User Hold          Idle           H               The job has been put in user hold.
                            Vacated            Idle           V               The job started but did not complete. The
                                                                              negotiator will reschedule the job
                                                                              (provided the job is allowed to be
                                                                              rescheduled). Possible reasons why a job
                                                                              moves to the Vacated state are: the
                                                                              machine where the job was running was
                                                                              flushed, the VACATE expression in the
                                                                              configuration file evaluated to True, or
                                                                              LoadLeveler detected a condition
                                                                              indicating the job needed to be vacated.
                                                                              For more information on the VACATE
                                                                              expression, see “Managing job status
                                                                              through control expressions” on page 68.
                            Vacate Pending     Idle           VP              The job is in the process of being vacated.



    Consumable resources
                            Consumable resources are assets available on machines in your LoadLeveler
                            cluster.

                            These assets are called ″resources″ because they model the commodities or services
|                           available on machines (including CPUs, real memory, virtual memory, large page
|                           memory, software licenses, disk space). They are considered ″consumable″ because
                            job steps use specified amounts of these commodities when the step is running.
                            Once the step finishes, the resource becomes available for another job step.

                            Consumable resources which model the characteristics of a specific machine (such
                            as the number of CPUs or the number of specific software licenses available only
                            on that machine) are called machine resources. Consumable resources which model
                            resources that are available across the LoadLeveler cluster (such as floating
                            software licenses) are called floating resources. For example, consider a
    22   TWS LoadLeveler: Using and Administering
configuration with 10 licenses for a given program (which can be used on any
    machine in the cluster). If these licenses are defined as floating resources, all 10 can
    be used on one machine, or they can be spread across as many as 10 different
    machines.

    The LoadLeveler administrator can specify:
    v Consumable resources to be considered by LoadLeveler’s scheduling algorithms
    v Quantity of resources available on specific machines
    v Quantity of floating resources available on machines in the cluster
    v Consumable resources to be considered in determining the priority of executing
      machines
    v Default amount of resources consumed by a job step of a specified job class
|   v Whether CPU, real memory, virtual memory, or large page resources should be
|     enforced using AIX Workload Manager (WLM)
    v Whether all jobs submitted need to specify resources

    Users submitting jobs can specify the resources consumed by each task of a job
    step, or the resources consumed by the job on each machine where it runs,
    regardless of the number of tasks assigned to that machine.

    If affinity scheduling support is enabled, the CPUs requested in the consumable
    resources requirement of a job will be used by the scheduler to determine the
    number of CPUs to be allocated and attached to that job’s tasks running on
    machines enabled for affinity scheduling. However, if the affinity scheduling
    request contains the processor-core affinity option, the number of CPUs will be
    determined from the value specified by the task_affinity keyword instead of the
    CPU’s value in the consumable resources requirement. For more information on
    scheduling affinity, see “LoadLeveler scheduling affinity support” on page 146.

    Note:
            1. When software licenses are used as a consumable resource, LoadLeveler
               does not attempt to obtain software licenses or to verify that software
               licenses have been obtained. However, by providing a user exit that can
               be invoked as a submit filter, the LoadLeveler administrator can provide
               code to first obtain the required license and then allow the job step to
               run. For more information on filtering job scripts, see “Filtering a job
               script” on page 76.
|           2. LoadLeveler scheduling algorithms use the availability of requested
|              consumable resources to determine the machine or machines on which a
|              job will run. Consumable resources (except for CPU, real memory, virtual
|              memory and large page) are only used for scheduling purposes and are
|              not enforced. Instead, LoadLeveler’s negotiator daemon keeps track of
|              the consumable resources available by reducing them by the amount
|              requested when a job step is scheduled, and increasing them when a
|              consuming job step completes.
            3. If a job is preempted, the job continues to use all consumable resources
               except for ConsumableCpus and ConsumableMemory (real memory)
               which are made available to other jobs.
            4. When the network adapters on a machine support RDMA, the machine
               is automatically given a consumable resource called RDMA with an
               available quantity defined by the limit on the number of concurrent jobs
               that use RDMA. For machines with the ″Switch Network Interface for
               HPS″ network adapters, this limit is 4. Machines with InfiniBand
               adapters are given unlimited RDMA resources.


                                                           Chapter 1. What is LoadLeveler?   23
5. When steps require RDMA, either because they request bulkxfer or
                                       because they request rcxtblocks on at least one network statement, the
                                       job is automatically given a resource requirement for 1 RDMA.

                 Consumable resources and AIX Workload Manager
|                           If the administrator has indicated that resources should be enforced, LoadLeveler
|                           uses AIX Workload Manager (WLM) to give greater control over CPU, real
|                           memory, virtual memory and large page resource allocation.

                            WLM monitors system resources and regulates their allocation to processes
                            running on AIX. These actions prevent jobs from interfering with each other when
                            they have conflicting resource requirements. WLM achieves this control by creating
                            different classes of service and allowing attributes to be specified for those classes.

                            LoadLeveler dynamically generates WLM classes with specific resource
                            entitlements. A single WLM class is created for each job step and the process id of
                            that job step is assigned to that class. This is done for each node that a job step is
                            assigned to run on. LoadLeveler then defines resource shares or limits for that class
                            depending on the LoadLeveler enforcement policy defined. These resource shares
                            or limits represent the job’s requested resource usage in relation to the amount of
                            resources available on the machine.

|                           When LoadLeveler defines multiple memory resources under one WLM class, AIX
|                           WLM uses the following order to determine if resource limits have been exceeded:
                            1. Real Memory Absolute Limit
                            2. Virtual Memory Absolute Limit
                            3. Large Page Limit
|                           4. Real Memory shares or percent limit

|                           Note: When real memory or CPU with either shares or percent limits are
|                                 exceeded, the job processes in that class receive a lower scheduling priority
|                                 until their utilization drops below the hard max limit. When virtual memory
|                                 or absolute real memory limits are exceeded, the job processes are killed.
|                                 When the large page limit is exceeded, any new large page requests are
|                                 denied.

                            When the enforcement policy is shares, LoadLeveler assigns a share value to the
                            class based on the resources requested for the job step (one unit of resource equals
                            one share). When the job step process is running, AIX WLM dynamically calculates
                            an appropriate resource entitlement based on the WLM class share value of the job
                            step and the total number of shares requested by all active WLM classes. It is
                            important to note that AIX WLM will only enforce these target percentages when
                            the resource is under contention.

                            When the enforcement policy is limits (soft or hard), LoadLeveler assigns a
                            percentage value to the class based on the resources requested for the job step and
                            the total machine resources. This resource percentage is enforced regardless of any
                            other active WLM classes. A soft limit indicates the maximum amount of the
                            resource that can be made available when there is contention for the resources.
                            This maximum can be exceeded if no one else requires the resource. A hard limit
                            indicates the maximum amount of the resource that can be made available even if
                            there is no contention for the resources.



    24   TWS LoadLeveler: Using and Administering
|                 Note: A WLM class is active for the duration of a job step and is deleted when the
|                       job step completes. There is a limit of 64 active WLM classes per machine.
|                       Therefore, when resources are being enforced, only 64 job steps can be
|                       running on one machine.

                  For additional information about integrating LoadLeveler with AIX Workload
                  Manager, see “Steps for integrating LoadLeveler with the AIX Workload Manager”
                  on page 137.

    Overview of reservations
                  Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make
                  reservations, which specify a time period during which specific node resources are
                  reserved for exclusive use by particular users or groups. This capability is known
                  in the computing industry as advance reservation.

                  Normally, jobs wait to be dispatched until the resources they require become
                  available. Through the use of reservations, wait time can be reduced because the
                  jobs have exclusive use of the node resources (CPUs, memory, disk drives,
                  communication adapters, and so on) as soon as the reservation period begins.

                  Note: Advance reservation supports Blue Gene resources including the Blue Gene
                        compute nodes. For more information, see “Blue Gene reservation support”
                        on page 159.

                  In addition to reducing wait time, reservations also are useful for:
                  v Running a workload that needs to start or finish at a particular time. The job
                    steps must be associated with, or bound to, the reservation before LoadLeveler
                    can run them during the reservation period.
|                 v Reserving resources for a workload that repeats at regular intervals. You can
|                   make a single request to create a recurring reservation, which reserves a specific
|                   set of resources for a specific time slot that repeats on a regular basis for a
|                   defined interval.
                  v Setting aside a set of nodes for maintenance purposes. In this case, job steps are
                    not bound to the reservation.
                  Only bound job steps may run on the reserved nodes, which means that a bound
                  job step competes for reserved resources only with other job steps that are bound
                  to the same reservation.

                  The following sequence of events describes, in general terms, how you can set up
                  and use reservations in the LoadLeveler environment. It also describes how
                  LoadLeveler manages activities related to the use of reservations.
                  1. Configuring LoadLeveler to support reservations
                     An administrator uses specific keywords in the configuration and
                     administration files to define general reservation policies. These keywords
                     include:
|                    v max_reservations, when used in the global configuration file defines the
|                       maximum number of reservations for the entire cluster.
|                    v max_reservations, when used in a user or group stanza of the administration
|                       file can also be used to define both:
                        – The users or groups that will be allowed to create reservations. To be
                            authorized to create reservations, LoadLeveler administrators also must
                            have the max_reservations keyword set in their own user or group
                            stanzas.

                                                                       Chapter 1. What is LoadLeveler?   25
– How many reservations users may own.

|                                Note: With recurring advance reservations, to avoid confusion about what
|                                        counts as one reservation, LoadLeveler is using the approach that one
|                                        reservation counts as one instance regardless of the number of times
|                                        the reservation recurs before it expires. This applies to the system
|                                        wide max_reservations configuation setting as well as the same type
|                                        of configuration settings at the user and group levels.
                               v max_reservation_duration, which defines the maximum duration for
                                 reservations.
                               v reservation_permitted, which defines the nodes that may be used for
                                 reservations.
|                              v max_reservation_expiration which defines how long recurring reservations
|                                are permitted to last (expressed as the number of days).
                               Administrators also may configure LoadLeveler to collect accounting data
                               about reservations when the reservations complete or are canceled.
                            2. Creating reservations
                               After LoadLeveler is configured for reservations, an administrator or
                               authorized user may create specific reservations, defining reservation attributes
                               that include:
                               v The start time and the duration of the reservation. The start and end times
                                 for a reservation are based on the time-of-day (TOD) clock on the central
                                 manager machine.
|                              v Whether or not the reservation recurs and if it recurs, the interval in which it
|                                does so.
                               v The nodes to be reserved. Until the reservation period actually begins, the
                                 selected nodes are available to run any jobs; when the reservation starts, only
                                 jobs bound to the reservation may run on the reserved nodes.
                               v The users or groups that may use the reservation.
                               LoadLeveler assigns a unique ID to the reservation, and returns that ID to the
                               owner.
                               After the reservation is successfully created:
                               v Reservation owners may:
                                 – Modify, query, and cancel their reservations.
                                 – Allow other LoadLeveler users or groups to submit jobs to run during a
                                    reservation period.
                                 – Submit jobs to run during a reservation period.
                               v Users or groups that are allowed to use the reservation also may query
                                 reservations, and submit jobs to run during a reservation period. To run jobs
                                 during a reservation period, users must bind job steps to the reservation. You
                                 may bind both batch and interactive POE job steps to a reservation.
                            3. Preparing for the start of a reservation
                               During the preparation time for a reservation, LoadLeveler:
                               v Preempts any jobs that are still running on the reserved nodes.
                               v Checks the condition of reserved nodes, and notifies the reservation owner
                                 and LoadLeveler administrators by e-mail of any situations that might
                                 require the reservation owner or an administrator to take corrective action.
                                 Such conditions include:
                                 – Reserved nodes that are down, suspended, no longer in the LoadLeveler
                                    cluster, or otherwise unavailable for use.
                                 – Non-preemptable job steps that cannot finish running before the
                                    reservation start time.


    26   TWS LoadLeveler: Using and Administering
During this time, reservation owners may modify, cancel, and add users or
                     groups to their reservations. Owners and users or groups that are allowed to
                     use the reservation may query the reservation or bind job steps to it.
                  4. Starting the reservation
                     When the reservation period begins, LoadLeveler dispatches job steps that are
                     bound to the reservation.
                     After the reservation period begins, reservation owners may modify, cancel,
                     and add users or groups to their reservations. Owners and users or groups that
                     are allowed to use the reservation may query the reservation or bind job steps
                     to it.
                     During the reservation period, LoadLeveler ignores system preemption rules
                     for bound job steps; however, LoadLeveler administrators may use the
                     llpreempt command to manually preempt bound job steps.

                  When the reservation ends or is canceled:
|                 v LoadLeveler unbinds all job steps from the reservation if there are no further
|                   occurrences remaining. At this point the unbound job steps compete with all
|                   other LoadLeveler jobs for available resources. If there are occurrences remaining
|                   in the reservation, job steps are automatically bound to the next occurrence.
                  v If accounting data is being collected for the reservation, LoadLeveler also
                    updates the reservation history file.

                  For more detailed information and instructions for setting up and using
                  reservations, see:
                  v “Configuring LoadLeveler to support reservations” on page 131.
                  v “Working with reservations” on page 213.

    Fair share scheduling overview
                  Fair share scheduling in LoadLeveler provides a way to divide resources in a
                  LoadLeveler cluster among users or groups of users.

                  Historic resource usage data that is collected at the time the job ends can be used
                  to influence job priorities to achieve the resource usage proportions allocated to
                  users or groups of users in the LoadLeveler configuration files. The resource usage
                  data will decay over time so that the relatively recent historic resource usage will
                  have the most influence on job priorities. The CPU resources in the cluster and the
                  Blue Gene resources are currently supported by fair share scheduling.

                  For information about configuring fair share scheduling in LoadLeveler, see “Using
                  fair share scheduling” on page 160.




                                                                       Chapter 1. What is LoadLeveler?   27
28   TWS LoadLeveler: Using and Administering
Chapter 2. Getting a quick start using the default
configuration
               If you are very familiar with UNIX and Linux system administration and job
               scheduling, follow these steps to get LoadLeveler up and running on your network
               quickly in a default configuration.

               This default configuration will merely enable you to submit serial jobs; for a more
               complex setup, see Chapter 4, “Configuring the LoadLeveler environment,” on
               page 41.

What you need to know before you begin
               LoadLeveler sets up default values for configuration information.
               v loadl is the recommended LoadLeveler user ID and the LoadLeveler group ID.
                 LoadLeveler daemons run under this user ID to perform file I/O, and many
                 LoadLeveler files are owned by this user ID.
               v The home directory of loadl is the configuration directory.
               v LoadL_config is the name of the configuration file.

               For information about configuration file keyword syntax and other details, see
               Chapter 12, “Configuration file reference,” on page 263.

Using the default configuration files
               Follow these steps to use the default configuration files.

               Note: You can find samples of the LoadL_admin and LoadL_config files in the
                     release directory (in the samples subdirectory).
               1. Ensure that the installation procedure has completed successfully and that the
                  configuration file, LoadL_config, exists in LoadLeveler’s home directory or in
                  the directory specified by the LoadLConfig keyword.
               2. Identify yourself as the LoadLeveler administrator in the LoadL_config file
                  using the LOADL_ADMIN keyword. The syntax of this keyword is:
                  LOADL_ADMIN = list_of_user_names (required)
                    Where list_of_user_names is a blank-delimited list of those individuals who
                    will have administrative authority.

                  Refer to “Defining LoadLeveler administrators” on page 43 for more
                  information.
               3. Define a machine to act as the LoadLeveler central manager by coding one
                  machine stanza as follows in the administration file, which is called
                  LoadL_admin. (Replace machine_name with the actual name of the machine.)
                  machine_name: type = machine

                  central_manager = true
                  Do not specify more than one machine as the central manager. Also, if during
                  installation, you ran llinit with the -cm flag, the central manager is already
                  defined in the LoadL_admin file because the llinit command takes parameters
                  that you entered and updates the administration and configuration files. See
                  “Defining machines” on page 84 for more information.

                                                                                                 29
LoadLeveler for Linux quick start
                            If you would like to quickly install and configure LoadLeveler for Linux and
                            submit a serial job on a single node, use these procedures.

                            Note: This setup is for a single node only and the node used for this example is:
                                  c197blade1b05.ppd.pok.ibm.com.

                 Quick installation
                            Details of this installation apply tor RHEL 4 System x servers.

                            Note: This installation method is, however, applicable to all other systems. You
                                  must install the corresponding license RPM for the system you are installing
                                  on. This installation assumes that the LoadLeveler RPMs are located at:
                                  /mnt/cdrom/.
                            1. Log on to node c197blade1b05.ppd.pok.ibm.com as root, which is the node you
                               are installing on.
                            2. Add a UNIX group for LoadLeveler users (make sure the group ID is correct)
                               by entering the following command:
                                groupadd -g 1000 loadl
                            3. Add a UNIX user for LoadLeveler (make sure the user ID is correct) by
                               entering the following command:
                                useradd -c "LoadLeveler User" -d /home/loadl -s /bin/bash -u 1001 -g 1000 -m loadl
|                           4. Install the license RPM by entering the following command:
|                               rpm -ivh /mnt/cdrom/LoadL-full-license-RH4-X86-3.5.0.0-0.i386.rpm
                            5. Change to the LoadLeveler installation path by entering the following the
                               command:
                                cd /opt/ibmll/LoadL/sbin
                            6. Run the LoadLeveler installation script by entering:
                                ./install_ll -y -d /mnt/cdrom
|                           7. Install the required LoadLeveler services updates for 3.5.0.1 for this RPM.
|                              Updates and installation instructions are available at:
|                              https://www14.software.ibm.com/webapp/set2/sas/f/loadleveler/download/
|                              intel.html

                 Quick configuration
                            Use this method to perform a quick configuration.
                            1. Change the log in to the newly created LoadLeveler user by entering the
                               following command:
                                su - loadl
                            2. Add the LoadLeveler bin directory to the search path:
                                export PATH=$PATH:/opt/ibmll/LoadL/full/bin
                            3. Run the LoadLeveler initialization script:
                                /opt/ibmll/LoadL/full/bin/llinit -local /tmp/loadl -release /opt/ibmll/LoadL/full -cm
                                c197blade1b05.ppd.pok.ibm.com


                 Quick verification
                            Use this method to perform a quick verification.
|                           1. Start LoadLeveler by entering the following command:
|                               llctl start


    30   TWS LoadLeveler: Using and Administering
|                     You should receive a response similar to the following:
|                     llctl: Attempting to start LoadLeveler on host c197blade1b05.ppd.pok.ibm.com
|                     LoadL_master 3.5.0.1 rsats001a 2008/10/29 RHEL 4.0 140
|                     CentralManager = c197blade1b05.ppd.pok.ibm.com
|                     [loadl@c197blade1b05 bin]$
                   2. Check LoadLeveler status by entering the following command:
                      llstatus
                      You should receive a response similar to the following:
                      Name                      Schedd InQ Act Startd Run LdAvg Idle Arch OpSys
                      c197blade1b05.ppd.pok.ibm Avail 0     0 Idle 0 0.00 1          i386 Linux2
                      i386/Linux2                 1 machines     0 jobs       0 running task
                      Total Machines              1 machines     0 jobs       0 running task

                      The central manager is defined on c197blade1b05.ppd.pok.ibm.com

                      The BACKFILL scheduler is in use

                      All machines on the machine_list are present.
                      [loadl@c197blade1b05 bin]$
                   3. Submit a sample job, by entering the following command:
                      llsubmit /opt/ibmll/LoadL/full/samples/job1.cmd
                      You should receive a response similar to the following:
                      llsubmit: The job "c197blade1b05.ppd.pok.ibm.com.1" with 2 job steps /
                       has been submitted.
                      [loadl@c197blade1b05 samples]$
                   4. Display the LoadLeveler job queue, by entering the following command:
                      llq
                      You should receive a response similar to the following:
                      Id                       Owner      Submitted   ST PRI Class        Running On
                      ------------------------ ---------- ----------- -- --- ------------ -----------
                      c197blade1b05.1.0       loadl       8/15 17:25 R 50 No_Class       c197blade1b05
                      c197blade1b05.1.1       loadl       8/15 17:25 I 50 No_Class
                      2 job step(s) in queue, 1 waiting, 0 pending, 1 running, 0 held, 0 preempted
                      [loadl@c197blade1b05 samples]$
                   5. Check output files into the home directory (/home/loadl) by entering the
                      following command:
                      ls -ltr job*
                      You should receive a response similar to the following:
                      -rw-rw-r-- 1 loadl loadl 1940 Aug 15 17:26 job1.c197blade1b05.1.0.out
                      -rw-rw-rw- 1 loadl loadl 1940 Aug 15 17:27 job1.c197blade1b05.1.1.out
                      [loadl@c197blade1b05 ~]$


    Post-installation considerations
                   This information explains how to start (or restart) and stop LoadLeveler. It also
                   tells you where files are located after you install LoadLeveler, and it points you to
                   troubleshooting information.

            Starting LoadLeveler
                   You can start LoadLeveler using any LoadLeveler administrator user ID as defined
                   in the configuration file.

                   To start all of the machines that are defined in machine stanzas in the
                   administration file, enter:
                   llctl -g start



                                                 Chapter 2. Getting a quick start using the default configuration   31
The central manager machine is the first started, followed by other machines in the
                        order listed in the administration file. See “llctl - Control LoadLeveler daemons”
                        on page 439 for more information.

                        By default, llctl uses rsh to start LoadLeveler on the target machine. Other
                        mechanisms, such as ssh can be used by setting the LL_RSH_COMMAND
                        configuration keyword in LoadL_config. However you choose to start LoadLeveler
                        on remote hosts, you must have the authority to run commands remotely on that
                        host.

                        You can verify that the machine has been properly configured by running the
                        sample jobs in the appropriate samples directory (job1.cmd, job2.cmd, and
                        job3.cmd). You must read the job2.cmd and job3.cmd files before submitting them
                        because job2 must be edited and a C program must be compiled to use job3. It is a
                        good idea to copy the sample jobs to another directory before modifying them; you
                        must have read/write permission to the directory in which they are located. You
                        can use the llsubmit command to submit the sample jobs from several different
                        machines and verify that they complete (see “llsubmit - Submit a job” on page
                        531).

                        If you are running AFS and some jobs do not complete, you might need to use the
                        AFS fs command (fs listacl) to ensure that the you have write permission to the
                        spool, execute, and log directories.

                        If you are running with cluster security services enabled and some jobs do not
                        complete, ensure that you have write permission to the spool, execute, and log
                        directories. Also ensure that the user ID is authorized to run jobs on the submitting
                        machine (the identity of the user must exist in the .rhosts file of the user on the
                        machine on which the job is being run).

                        Note: LoadLeveler for Linux does not support cluster security services.

                        If you are running submit-only LoadLeveler, once the LoadLeveler pool is up and
                        running, you can use the llsubmit, llq, and llcancel commands from the
                        submit-only machines. For more information about these commands, see
                        v “llsubmit - Submit a job” on page 531
                        v “llq - Query job status” on page 479
                        v “llcancel - Cancel a submitted job” on page 421

                        You can also invoke the LoadLeveler graphical user interface xloadl_so from the
                        submit-only machines (see Chapter 15, “Graphical user interface (GUI) reference,”
                        on page 403).

             Location of directories following installation
                        After installation, the product directories reside on disk.

                        The product directories that reside on disk after installation are shown in Table 7
                        on page 33. The installation process creates only those directories required to
                        service the LoadLeveler options specified during the installation. For AIX,
                        release_directory indicates /usr/lpp/LoadL/full and for Linux, it indicates
                        /opt/ibmll/LoadL/full.




32   TWS LoadLeveler: Using and Administering
Table 7. Location and description of product directories following installation
Directory                                         Description
release_directory/bin                             Part of the release directory containing
                                                  daemons, commands, and other binaries
release_directory/lib                             Part of the release directory containing
                                                  product libraries and resource files
release_directory/man                             Part of the release directory containing man
                                                  pages
release_directory/samples                         Part of the release directory containing
                                                  sample administration and configuration files
                                                  and sample jobs
release_directory/include                         Part of the release directory containing
                                                  header files for the application programming
                                                  interfaces
Local directory                                   spool, execute, and log directories for each
                                                  machine in the cluster
Home directory                                    Administration and configuration files, and
                                                  symbolic links to the release directory
/usr/lpp/LoadL/codebase                           Configuration tasks for AIX


Table 8 shows the location of directories for submit-only LoadLeveler:
Table 8. Location and description of directories for submit-only LoadLeveler
Directory                                         Description
release_directory/so/bin                          Part of the release directory containing
                                                  commands
release_directory/so/man                          Part of the release directory containing man
                                                  pages
release_directory/so/samples                      Part of the release directory containing
                                                  sample administration and configuration files
release_directory/so/lib                          Contains libraries and graphical user
                                                  interface resource files
Home directory                                    Contains administration and configuration
                                                  files


If you have a mixed LoadLeveler cluster of AIX and Linux machines, you might
want to make the following symbolic links:
v On AIX, as root, enter:
  mkdir -p /opt/ibmll
  ln -s /usr/lpp/LoadL /opt/ibmll/LoadL
v On Linux, as root, enter:
  mkdir -p /usr/lpp
  ln -s /opt/ibmll/LoadL /usr/lpp/LoadL

With the addition of these symbolic links, a user application can use either
/usr/lpp/LoadL or /opt/ibmll/LoadL to refer to the location of LoadLeveler files
regardless of whether the application is running on AIX or Linux.

If LoadLeveler will not start following installation, see “Why won’t LoadLeveler
start?” on page 700 for troubleshooting information.

                                Chapter 2. Getting a quick start using the default configuration   33
34   TWS LoadLeveler: Using and Administering
Chapter 3. What operating systems are supported by
    LoadLeveler?
                  LoadLeveler supports three operating systems.
|                 v AIX 6.1 and AIX 5.3
|                   IBM’s AIX 6.1 and AIX 5.3 are open UNIX operating environments that conform
|                   to The Open Group UNIX 98 Base Brand industry standard. AIX 6.1 and AIX 5.3
|                   provide high levels of integration, flexibility, and reliability and operate on IBM
|                   Power Systems and IBM Cluster 1600 servers and workstations.
|                   AIX 6.1 and AIX 5.3 support the concurrent operation of 32- and 64-bit
|                   applications, with key internet technologies such as Java™ and XML parser for
|                   Java included as part of the base operating system.
|                   A strong affinity between AIX and Linux permits popular applications
|                   developed on Linux to run on AIX 6.1 and AIX 5.3 with a simple recompilation.
|                 v Linux
|                   LoadLeveler supports the following distributions of Linux:
|                   – Red Hat® Enterprise Linux (RHEL) 4 and RHEL 5
|                   – SUSE Linux Enterprise Server (SLES) 9 and SLES 10
                  v IBM System Blue Gene Solution
                    While no LoadLeveler processes actually run on the Blue Gene machine,
                    LoadLeveler can interact with the Blue Gene machine and supports the
                    scheduling of jobs to the machine.

                    Note: For models of the Blue Gene system such as Blue Gene/S, which can only
                          run a single job at a time, LoadLeveler does not have to be configured to
                          schedule resources for Blue Gene jobs. For such systems, serial jobs can be
                          used to submit work to the front end node for the Blue Gene system.

    LoadLeveler for AIX and LoadLeveler for Linux compatibility
                  LoadLeveler for Linux is compatible with LoadLeveler for AIX. Its command line
                  interfaces, graphical user interfaces, and application programming interfaces (APIs)
                  are the same as they have been for AIX. The formats of the job command file,
                  configuration file, and administration file also remain the same.

                  System administrators can set up and maintain a LoadLeveler cluster consisting of
                  some machines running LoadLeveler for AIX and some machines running
                  LoadLeveler for Linux. This is called a mixed cluster. In this mixed cluster jobs can
                  be submitted from either AIX or Linux machines. Jobs submitted to a Linux job
                  queue can be dispatched to an AIX machine for execution, and jobs submitted to
                  an AIX job queue can be dispatched to a Linux machine for execution.

                  Although the LoadLeveler products for AIX and Linux are compatible, they do
                  have some differences in the level of support for specific features. For further
                  details, see the following topics:
                  v “Restrictions for LoadLeveler for Linux” on page 36.
                  v “Features not supported in LoadLeveler for Linux” on page 36.
                  v “Restrictions for LoadLeveler for AIX and LoadLeveler for Linux mixed clusters”
                    on page 37.


                                                                                                     35
Restrictions for LoadLeveler for Linux
                            LoadLeveler for Linux supports a subset of the features that are available in the
                            LoadLeveler for AIX product.

                            The following features are available, but are subject to restrictions:
                            v 32-bit applications using the LoadLeveler APIs
                              LoadLeveler for Linux supports only the 32-bit LoadLeveler API library
                              (libllapi.so) on the following platforms:
                              – RHEL 4 and RHEL 5 on IBM IA-32 xSeries® servers
                              – SLES 9 and SLES 10 on IBM IA-32 xSeries servers
                              Applications linked to the LoadLeveler APIs on these platforms must be 32-bit
                              applications.
                            v 64–bit applications using the LoadLeveler APIs
                              LoadLeveler for Linux supports only the 64-bit LoadLeveler API library
                              (libllapi.so) on the following platforms:
                              – RHEL 4 and RHEL 5 on IBM xSeries servers with AMD Opteron or Intel
                                  EM64T processors
                              – RHEL 4 and RHEL 5 on POWER™ servers
                              – SLES 9 and SLES 10 on IBM xSeries servers with AMD Opteron or Intel
                                  EM64T processors
                              – SLES 9 and SLES 10 on POWER servers
                              Applications linked to the LoadLeveler APIs on these platforms must be 64-bit
                              applications.
                            v Support for AFS file systems
                              LoadLeveler for Linux support for authenticated access to AFS file systems is
                              limited to RHEL 4 on xSeries servers and IBM xSeries servers with AMD
                              Opteron or Intel EM64T processors. It is not available on systems running SLES
                              9 or SLES 10.

                 Features not supported in LoadLeveler for Linux
                            LoadLeveler for Linux supports a subset of the features that are available in the
                            LoadLeveler for AIX product.

                            The following features are not supported:
                            v RDMA consumable resource
                              On systems with High Performance Switch adapters, RDMA consumable
                              resources are not supported on LoadLeveler for Linux.
                            v User context RDMA blocks
                              User context RDMA blocks are not supported by LoadLeveler for Linux.
                            v Checkpoint/restart
                              LoadLeveler for AIX uses a number of features that are specific to the AIX
                              kernel to provide support for checkpoint/restart of user applications running
                              under LoadLeveler. Checkpoint/restart is not available in this release of
                              LoadLeveler for Linux.
                            v AIX Workload management (WLM)
                              WLM can strictly control use of system resources. LoadLeveler for AIX uses
                              WLM to enforce the use of a number of consumable resources defined by
|                             LoadLeveler (such as ConsumableCpus, ConsumableVirtualMemory,




    36   TWS LoadLeveler: Using and Administering
|           ConsumableLargePageMemory , and ConsumableMemory). This enforcement
            of consumable resources usage through WLM is not available in this release of
            LoadLeveler for Linux.
          v CtSec security
            LoadLeveler for AIX can exploit CtSec (Cluster Security Services) security
            functions. These functions authenticate the identity of users and programs
            interacting with LoadLeveler. These features are not available in this release of
            LoadLeveler for Linux.
          v LoadL_GSmonitor daemon
            The LoadL_GSmonitor daemon in the LoadLeveler for AIX product uses the
            Group Services Application Programming Interface (GSAPI) to monitor machine
            availability and notify the LoadLeveler central manager when a machine is no
            longer reachable. This daemon is not available in the LoadLeveler for Linux
            product.
          v Task guide tool
          v System error log
            Each LoadLeveler daemon has its own log file where information relevant to its
            operation is recorded. In addition to this feature which exists on all platforms,
            LoadLeveler for AIX also uses the errlog function to record critical LoadLeveler
            events into the AIX system log. Support for an equivalent Linux function is not
            available in this release.

    Restrictions for LoadLeveler for AIX and LoadLeveler for
    Linux mixed clusters
|         Several restrictions apply when operating a LoadLeveler cluster that contains AIX
|         6.1 and AIX 5.3 and Linux machines.

|         When operating a LoadLeveler cluster that contains AIX 6.1 and AIX 5.3 and Linux
          machines, the following restrictions apply:
          v The central manager node must run a version of LoadLeveler equal to or higher
            than any LoadLeveler version being run on a node in the cluster.
          v CtSec security features cannot be used.
          v AIX jobs that use checkpointing must be sent to AIX nodes for execution. This
            can be done by either defining and specifying job checkpointing for job classes
            that exist only on AIX nodes or by coding appropriate requirements expressions.
            Checkpointing jobs that are sent to a Linux node will be rejected by the
            LoadL_startd daemon running on the Linux node.
          v WLM is supported in a mixed cluster. However, enforcement of the use of
            consumable resources will occur through WLM on AIX nodes only.




                                   Chapter 3. What operating systems are supported by LoadLeveler?   37
38   TWS LoadLeveler: Using and Administering
Part 2. Configuring and managing the TWS LoadLeveler
environment
            After installing IBM Tivoli Workload Scheduler (TWS) LoadLeveler, you may
            customize it by modifying both the configuration file and the administration file
            (see Part 1, “Overview of TWS LoadLeveler concepts and operation,” on page 1 for
            overview information). The configuration file contains many parameters that you
            can set or modify that will control how TWS LoadLeveler operates. The
            administration file optionally lists and defines the machines in the TWS
            LoadLeveler cluster and the characteristics of classes, users, and groups.

            To easily manage TWS LoadLeveler, you should have one global configuration file
            and only one administration file, both centrally located on a machine in the TWS
            LoadLeveler cluster. Every other machine in the cluster must be able to read the
            configuration and administration file that are located on the central machine.

            You may have multiple local configuration files that specify information specific to
            individual machines.

            TWS LoadLeveler does not prevent you from having multiple copies of
            administration files, but you need to be sure to update all the copies whenever you
            make a change to one. Having only one administration file prevents any confusion.




                                                                                              39
40   TWS LoadLeveler: Using and Administering
Chapter 4. Configuring the LoadLeveler environment
            One of your main tasks as system administrator is to configure LoadLeveler.

            To configure LoadLeveler, you need to know what the configuration information is
            and where it is located. Configuration information includes the following:
            v The LoadLeveler user ID and group ID
            v The configuration directory
            v The global configuration file

            Configuring LoadLeveler involves modifying the configuration files that specify
            the terms under which LoadLeveler can use machines. There are two types of
            configuration files:
            v Global Configuration File: This file by default is called the LoadL_config file and it
              contains configuration information common to all nodes in the LoadLeveler
              cluster.
            v Local Configuration File: This file is generally called LoadL_config.local (although
              it is possible for you to rename it). This file contains specific configuration
              information for an individual node. The LoadL_config.local file is in the same
              format as LoadL_config and the information in this file overrides any
              information specified in LoadL_config. It is an optional file that you use to
              modify information on a local machine. Its full path name is specified in the
              LoadL_config file by using the LOCAL_CONFIG keyword. See “Specifying file
              and directory locations” on page 47 for more information.
            Table 9 identifies where you can find more information about using configuration
            and administration files to modify the TWS LoadLeveler environment.
            Table 9. Roadmap of tasks for TWS LoadLeveler administrators
            To learn about:                          Read the following:
            Controlling how TWS LoadLeveler          Chapter 4, “Configuring the LoadLeveler
            operates by customizing the global or    environment”
            local configuration file
            Controlling TWS LoadLeveler resources    Chapter 5, “Defining LoadLeveler resources to
            by customizing an administration file    administer,” on page 83
            Additional ways to modify TWS            Chapter 6, “Performing additional administrator
            LoadLeveler that require customization   tasks,” on page 103
            of both the configuration and
            administration files
            Ways to control or monitor TWS           v Chapter 16, “Commands,” on page 411
            LoadLeveler operations by using the
                                                     v Chapter 7, “Using LoadLeveler’s GUI to
            TWS LoadLeveler commands, GUI, and
                                                       perform administrator tasks,” on page 169
            APIs
                                                     v Chapter 17, “Application programming
                                                       interfaces (APIs),” on page 541


            You can run your installation with default values set by LoadLeveler, or you can
            change any or all of them. Table 10 on page 42 lists topics that discuss how you
            may configure the LoadLeveler environment by modifying the configuration file.




                                                                                                       41
Table 10. Roadmap of administrator tasks related to using or modifying the LoadLeveler
                        configuration file
                        To learn about:                Read the following:
                        Using the default              Chapter 2, “Getting a quick start using the default
                        configuration files shipped    configuration,” on page 29
                        with LoadLeveler
                        Modifying the global and       “Modifying a configuration file”
                        local configuration files
                        Defining major elements of    v “Defining LoadLeveler administrators” on page 43
                        the LoadLeveler configuration
                                                      v “Defining a LoadLeveler cluster” on page 44
                                                       v “Defining LoadLeveler machine characteristics” on page
                                                         54
                                                       v “Defining security mechanisms” on page 56
                                                       v “Defining usage policies for consumable resources” on
                                                         page 60
                                                       v “Steps for configuring a LoadLeveler multicluster” on
                                                         page 151
                        Enabling optional              v “Enabling support for bulk data transfer and rCxt blocks”
                        LoadLeveler functions            on page 61
                                                       v “Gathering job accounting data” on page 61
                                                       v “Managing job status through control expressions” on
                                                         page 68
                                                       v “Tracking job processes” on page 70
                                                       v “Querying multiple LoadLeveler clusters” on page 71
                        Modifying LoadLeveler          “Providing additional job-processing controls through
                        operations through             installation exits” on page 72
                        installation exits



Modifying a configuration file
                        By taking a look at the configuration files that come with LoadLeveler, you will
                        find that there are many parameters that you can set. In most cases, you will only
                        have to modify a few of these parameters.

                        In some cases, though, depending upon the LoadLeveler nodes, network
                        connection, and hardware availability, you may need to modify additional
                        parameters.

                        All LoadLeveler commands, daemons, and processes read the administration and
                        configuration files at start up time. If you change the administration or
                        configuration files after LoadLeveler has already started, any LoadLeveler
                        command or process, such as the LoadL_starter process, will read the newer
                        version of the files while the running daemons will continue to use the data from
                        the older version. To ensure that all LoadLeveler commands, daemons, and
                        processes use the same configuration data, run the reconfiguration command on all
                        machines in the cluster each time the administration or configuration files are
                        changed.

                        To override the defaults, you must update the following keywords in the
                        /etc/LoadL.cfg file:
                        LoadLUserid
                                Specifies the LoadLeveler user ID.

42   TWS LoadLeveler: Using and Administering
LoadLGroupid
                    Specifies the LoadLeveler group ID.
              LoadLConfig
                    Specifies the full path name of the configuration file.

              Note that if you change the LoadLeveler user ID to something other than loadl,
              you will have to make sure your configuration files are owned by this ID.

              If Cluster Security (CtSec) services is enabled, make sure you update the unix.map
              file if the LoadLUserid is specified as something other than loadl. Refer to “Steps
              for enabling CtSec services” on page 58 for more details.

              You can also override the /etc/LoadL.cfg file. For an example of when you might
              want to do this, see “Querying multiple LoadLeveler clusters” on page 71.

              Before you modify a configuration file, you need to:
              v Ensure that the installation procedure has completed successfully and that the
                configuration file, LoadL_config, exists in LoadLeveler’s home directory or in
                the directory specified in /etc/LoadL.cfg. For additional details about installation,
                see TWS LoadLeveler: Installation Guide.
              v Know how to correctly specify keywords in the configuration file. For
                information about configuration file keyword syntax and other details, see
                Chapter 12, “Configuration file reference,” on page 263.
              v Identify yourself as the LoadLeveler administrator using the LOADL_ADMIN
                keyword.

              After you finish modifying the configuration file, notify LoadLeveler daemons by
              issuing the llctl command with either the reconfig or recycle keyword. Otherwise,
              LoadLeveler will not process the modifications you made to the configuration file.

Defining LoadLeveler administrators
              Specify the LOADL_ADMIN keyword with a list of user names of those
              individuals who will have administrative authority.

              These users are able to invoke the administrator-only commands such as llctl,
              llfavorjob, and llfavoruser. These administrators can also invoke the
              administrator-only GUI functions. For more information, see Chapter 7, “Using
              LoadLeveler’s GUI to perform administrator tasks,” on page 169.

              LoadLeveler administrators on this list also receive mail describing problems that
              are encountered by the master daemon. When CtSec is enabled, the
              LOADL_ADMIN list is used only as a mailing list. For more information, see
              “Defining security mechanisms” on page 56.

              An administrator on a machine is granted administrative privileges on that
              machine. It does not grant him administrative privileges on other machines. To be
              an administrator on all machines in the LoadLeveler cluster, either specify your
              user ID in the global configuration file with no entries in the local configuration
              file, or specify your user ID in every local configuration file that exists in the
              LoadLeveler cluster.

              For information about configuration file keyword syntax and other details, see
              Chapter 12, “Configuration file reference,” on page 263.


                                                   Chapter 4. Configuring the LoadLeveler environment   43
Defining a LoadLeveler cluster
                            It will be necessary to define the characteristics of the LoadLeveler cluster.

                            Table 11 lists the topics that discuss how you can define the characteristics of the
                            LoadLeveler cluster.
                            Table 11. Roadmap for defining LoadLeveler cluster characteristics
                            To learn about:                   Read the following:
                            Defining characteristics of       v “Choosing a scheduler”
                            specific LoadLeveler daemons
                                                              v “Setting negotiator characteristics and policies” on page
                                                                45
                                                              v “Specifying alternate central managers” on page 46
                            Defining other cluster            v “Defining network characteristics” on page 47
                            characteristics
                                                              v “Specifying file and directory locations” on page 47
                                                              v “Configuring recording activity and log files” on page
                                                                48
                                                              v “Setting up file system monitoring” on page 54
                            Correctly specifying              Chapter 12, “Configuration file reference,” on page 263
                            configuration file keywords
                            Working with daemons and          v “llctl - Control LoadLeveler daemons” on page 439
                            machines in a LoadLeveler
                                                              v “llinit - Initialize machines in the LoadLeveler cluster”
                            cluster
                                                                on page 457



                 Choosing a scheduler
                            This topic discusses the types of schedulers available, which you may specify using
                            the configuration file keyword SCHEDULER_TYPE.

                            For information about the configuration file keyword syntax and other details, see
                            Chapter 12, “Configuration file reference,” on page 263.
|                           LL_DEFAULT
|                                 This scheduler runs serial jobs. It efficiently uses CPU time by scheduling
|                                 jobs on what otherwise would be idle nodes (and workstations). It does
|                                 not require that users set a wall clock limit. Also, this scheduler starts,
|                                 suspends, and resumes jobs based on workload.
|                           BACKFILL
|                                This scheduler runs both serial and parallel jobs. The objective of
|                                BACKFILL scheduling is to maximize the use of resources to achieve the
|                                highest system efficiency, while preventing potentially excessive delays in
|                                starting jobs with large resource requirements. These large jobs can run
|                                because the BACKFILL scheduler does not allow jobs with smaller resource
|                                requirements to continuously use up resource before the larger jobs can
|                                accumulate enough resource to run.
|                                    The BACKFILL scheduler supports:
|                                    v The scheduling of multiple tasks per node
|                                    v The scheduling of multiple user space tasks per adapter
|                                    v The preemption of jobs
|                                    v The use of reservations
|                                    v The scheduling of inbound and outbound data staging tasks



    44   TWS LoadLeveler: Using and Administering
|                 v Scale-across scheduling that allows you to take advantage of
|                   underutilized resources in a multicluster installation
|                 These functions are not supported by the default LoadLeveler scheduler.
|                 For more information about the BACKFILL scheduler, see “Using the
|                 BACKFILL scheduler” on page 110.
          API     This keyword option allows you to enable an external scheduler, such as
                  the Extensible Argonne Scheduling sYstem (EASY). The API option is
                  intended for installations that want to create a scheduling algorithm for
                  parallel jobs based on site-specific requirements.
                  For more information about external schedulers, see “Using an external
                  scheduler” on page 115.

    Setting negotiator characteristics and policies
          You may set the following negotiator characteristics and policies.

          For information about configuration file keyword syntax and other details, see
          Chapter 12, “Configuration file reference,” on page 263.
          v Prioritize the queue maintained by the negotiator
            Each job step submitted to LoadLeveler is assigned a system priority number,
            based on the evaluation of the SYSPRIO keyword expression in the
            configuration file of the central manager. The LoadLeveler system priority
            number is assigned when the central manager adds the new job step to the
            queue of job steps eligible for dispatch. Once assigned, the system priority
            number for a job step is not changed, except under the following circumstances:
            – An administrator or user issues the llprio command to change the system
               priority of the job step.
            – The value set for the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL
               keyword is not zero.
            – An administrator uses the llmodify command with the -s option to alter the
               system priority of a job step.
            – A program with administrator credentials uses the ll_modify subroutine to
               alter the system priority of a job step.
            Job steps assigned higher SYSPRIO numbers are considered for dispatch before
            job steps with lower numbers.
            For related information, see the following topics:
            – “Controlling the central manager scheduling cycle” on page 73.
            – “Setting and changing the priority of a job” on page 230.
            – “llmodify - Change attributes of a submitted job step” on page 464.
            – “ll_modify subroutine” on page 677.
          v Prioritize the order of executing machines maintained by the negotiator
            Each executing machine is assigned a machine priority number, based on the
            evaluation of the MACHPRIO keyword expression in the configuration file of
            the central manager. The LoadLeveler machine priority number is updated every
            time the central manager updates its machine data. Machines assigned higher
            MACHPRIO numbers are considered to run jobs before machines with lower
            numbers. For example, a machine with a MACHPRIO of 10 is considered to run
            a job before a machine with a MACHPRIO of 5. Similarly, a machine with a
            MACHPRIO of -2 would be considered to run a job before a machine with a
            MACHPRIO of -3.
            Note that the MACHPRIO keyword is valid only on the machine where the
            central manager is running. Using this keyword in a local configuration file has
            no effect.

                                              Chapter 4. Configuring the LoadLeveler environment   45
When you use a MACHPRIO expression that is based on load average, the
                              machine may be temporarily ordered later in the list immediately after a job is
                              scheduled to that machine. This temporary drop in priority happens because the
                              negotiator adds a compensating factor to the startd machine’s load average
                              every time the negotiator assigns a job. For more information, see the
                              NEGOTIATOR_LOADAVG_INCREMENT keyword.
                            v Specify additional negotiator policies
                              This topic lists keywords that were not mentioned in the previous configuration
                              steps. Unless your installation has special requirements for any of these
                              keywords, you can use them with their default settings.
                              – NEGOTIATOR_INTERVAL
                              – NEGOTIATOR_CYCLE_DELAY
                              – NEGOTIATOR_CYCLE_TIME_LIMIT
                              – NEGOTIATOR_LOADAVG_INCREMENT
                              – NEGOTIATOR_PARALLEL_DEFER
                              – NEGOTIATOR_PARALLEL_HOLD
                              – NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL
                              – NEGOTIATOR_REJECT_DEFER
                              – NEGOTIATOR_REMOVE_COMPLETED
                              – NEGOTIATOR_RESCAN_QUEUE
|                             – SCALE_ACROSS_SCHEDULING_TIMEOUT

                 Specifying alternate central managers
                            In one of your machine stanzas specified in the administration file, you specified
                            that the machine would serve as the central manager.

                            It is possible for some problem to cause this central manager to become unusable
                            such as network communication or software or hardware failures. In such cases,
                            the other machines in the LoadLeveler cluster believe that the central manager
                            machine is no longer operating. To remedy this situation, you can assign one or
                            more alternate central managers in the machine stanza to take control.

                            The following machine stanza example defines the machine deep_blue as an
                            alternate central manager:
                            #
                            deep_blue: type=machine
                            central_manager = alt

                            If the primary central manager fails, the alternate central manager then becomes
                            the central manager. The alternate central manager is chosen based upon the order
                            in which its respective machine stanza appears in the administration file.

                            When an alternate becomes the central manager, jobs will not be lost, but it may
                            take a few minutes for all of the machines in the cluster to check in with the new
                            central manager. As a result, job status queries may be incorrect for a short time.

                            When you define alternate central managers, you should set the following
                            keywords in the configuration file:
                            v CENTRAL_MANAGER_HEARTBEAT_INTERVAL
                            v CENTRAL_MANAGER_TIMEOUT

                            In the following example, the alternate central manager will wait for 30 intervals,
                            where each interval is 45 seconds:




    46   TWS LoadLeveler: Using and Administering
# Set a 45 second interval
      CENTRAL_MANAGER_HEARTBEAT_INTERVAL = 45
      # Set the number of intervals to wait
      CENTRAL_MANAGER_TIMEOUT = 30

      For more information on central manager backup, refer to “What happens if the
      central manager isn’t operating?” on page 708. For information about configuration
      file keyword syntax and other details, see Chapter 12, “Configuration file
      reference,” on page 263.

Defining network characteristics
      A port number is an integer that specifies the port to use to connect to the
      specified daemon.

      You can define these port numbers in the configuration file or the /etc/services file
      or you can accept the defaults. LoadLeveler first looks in the configuration file for
      these port numbers. If LoadLeveler does not find the value in the configuration
      file, it looks in the /etc/services file. If the value is not found in this file, the default
      is used.

      See Appendix C, “LoadLeveler port usage,” on page 741 for more information.

Specifying file and directory locations
      The configuration file provided with LoadLeveler specifies default locations for all
      of the files and directories.

      You can modify their locations using the keywords shown in Table 12. Keep in
      mind that the LoadLeveler installation process installs files in these directories and
      these files may be periodically cleaned up. Therefore, you should not keep any
      files that do not belong to LoadLeveler in these directories.

      Managing distributed software systems is a primary concern for all system
      administrators. Allowing users to share file systems to obtain a single,
      network-wide image, is one way to make managing LoadLeveler easier.
      Table 12. Default locations for all of the files and directories
      To specify the
      location of the:   Specify this keyword:
      Administration     ADMIN_FILE
      file
      Local              LOCAL_CONFIG
      configuration
      file
      Local directory    The following subdirectories reside in the local directory. It is possible that
                         the local directory and LoadLeveler’s home directory are the same.
                         v COMM
                         v EXECUTE
                         v LOG
                         v SPOOL and HISTORY

                         Tip: To maximize performance, you should keep the log, spool, and
                         execute directories in a local file system. Also, to measure the performance
                         of your network, consider using one of the available products, such as
                         Toolbox/6000.




                                                 Chapter 4. Configuring the LoadLeveler environment   47
Table 12. Default locations for all of the files and directories (continued)
                        To specify the
                        location of the:   Specify this keyword:
                        Release            RELEASEDIR
                        directory
                                           The following subdirectories are created during installation and they
                                           reside in the release directory. You can change their locations.
                                           v BIN
                                           v LIB
                        Core dump          You may specify alternate directories to hold core dumps for the daemons
                        directory          and starter process:
                                           v MASTER_COREDUMP_DIR
                                           v NEGOTIATOR_COREDUMP_DIR
                                           v SCHEDD_COREDUMP_DIR
                                           v STARTD_COREDUMP_DIR
                                           v GSMONITOR_COREDUMP_DIR
                                           v KBDD_COREDUMP_DIR
                                           v STARTER_COREDUMP_DIR

                                           When specifying core dump directories, be sure that the access
                                           permissions are set so the LoadLeveler daemon or process can write to
                                           the core dump directory. The permissions set for path names specified in
                                           the keywords just mentioned must allow writing by both root and the
                                           LoadLeveler ID. The permissions set for the path name specified for the
                                           STARTER_COREDUMP_DIR keyword must allow writing by root, the
                                           LoadLeveler ID, and any user who can submit LoadLeveler jobs.

                                           The simplest way to be sure the access permissions are set correctly is to
                                           set them the same as are set for the /tmp directory.

                                           If a problem with access permissions prevents a LoadLeveler daemon or
                                           process from writing to a core dump directory, then a message will be
                                           written to the log, and the daemon or process will continue using the
                                           default /tmp directory for core files.


                        For information about configuration file keyword syntax and other details, see
                        Chapter 12, “Configuration file reference,” on page 263.

             Configuring recording activity and log files
                        The LoadLeveler daemons and processes keep log files according to the
                        specifications in the configuration file.

                        Administrators can also configure the LoadLeveler daemons to store additional
                        debugging messages in a circular buffer in memory. A number of keywords are
                        used to describe where LoadLeveler maintains the logs and how much information
                        is recorded in each log and buffer. These keywords, shown in Table 13 on page 49,
                        are repeated in similar form to specify the path name of the log file, its maximum
                        length, the size of the circular buffer, and the debug flags to be used for the log
                        and the buffer.

                        “Controlling the logging buffer” on page 50 describes how administrators can
                        configure LoadLeveler to buffer debugging messages.

                        “Controlling debugging output” on page 51 describes the events that can be
                        reported through logging controls.



48   TWS LoadLeveler: Using and Administering
“Saving log files” on page 53 describes the configuration keyword to use to save
                        logs for problem diagnosis.

                        For information about configuration file keyword syntax and other details, see
                        Chapter 12, “Configuration file reference,” on page 263.
Table 13. Log control statements
Daemon/           Log File (required)     Max Length (required)                     Debug Control (required)
Process
                  (See note 1)            (See note 2)                              (See note 4 on page 50)
Master            MASTER_LOG =            MAX_MASTER_LOG = bytes [buffer            MASTER_DEBUG = flags [buffer
                  path                    bytes]                                    flags]
Schedd            SCHEDD_LOG =            MAX_SCHEDD_LOG = bytes [buffer            SCHEDD_DEBUG = flags [buffer
                  path                    bytes]                                    flags]
Startd            STARTD_LOG = path MAX_STARTD_LOG = bytes [buffer                  STARTD_DEBUG = flags [buffer
                                    bytes]                                          flags]
Starter           STARTER_LOG =           MAX_STARTER_LOG = bytes [buffer           STARTER_DEBUG = flags [buffer
                  path                    bytes]                                    flags]
Negotiator        NEGOTIATOR_LOG          MAX_NEGOTIATOR_LOG = bytes                NEGOTIATOR_DEBUG = flags
                  = path                  [buffer bytes]                            [buffer flags]
Kbdd              KBDD_LOG = path         MAX_KBDD_LOG = bytes [buffer              KBDD_DEBUG = flags [buffer
                                          bytes]                                    flags]
GSmonitor         GSMONITOR_LOG           MAX_GSMONITOR_LOG = bytes                 GSMONITOR_DEBUG = flags
                  = path                  [buffer bytes]                            [buffer flags]


                        where:
                        buffer bytes
                                 Is the size of the circular buffer. The default value is 0, which indicates that
                                 the buffer is disabled. To prevent the daemon from running out of
                                 memory, this value should not be too large. Brackets must be used to
                                 specify buffer bytes.
                        buffer flags
                                  Indicates that messages with buffer flags in addition to messages with flags
                                  will be stored in the circular buffer in memory. The default value is blank,
                                  which indicates that the logging buffer is disabled because no additional
                                  debug flags were specified for buffering. Brackets must be used to specify
                                  buffer flags.

                        Note:
                                 1. When coding the path for the log files, it is not necessary that all
                                    LoadLeveler daemons keep their log files in the same directory, however,
                                    you will probably find it a convenient arrangement.
                                 2. There is a maximum length, in bytes, beyond which the various log files
                                    cannot grow. Each file is allowed to grow to the specified length and is
                                    then saved to an .old file. The .old files are overwritten each time the log
                                    is saved, thus the maximum space devoted to logging for any one
                                    program will be twice the maximum length of its log file. The default
                                    length is 64 KB. To obtain records over a longer period of time, that do
                                    not get overwritten, you can use the SAVELOGS keyword in the local or
                                    global configuration files. See “Saving log files” on page 53 for more
                                    information on extended capturing of LoadLeveler logs.


                                                                  Chapter 4. Configuring the LoadLeveler environment   49
You can also specify that the log file be started anew with every
                                   invocation of the daemon by setting the TRUNC statement to true as
                                   follows:
                                   v TRUNC_MASTER_LOG_ON_OPEN = true|false
                                   v TRUNC_STARTD_LOG_ON_OPEN = true|false
                                   v TRUNC_SCHEDD_LOG_ON_OPEN = true|false
                                   v TRUNC_KBDD_LOG_ON_OPEN = true|false
                                   v TRUNC_STARTER_LOG_ON_OPEN = true|false
                                   v TRUNC_NEGOTIATOR_LOG_ON_OPEN = true|false
                                   v TRUNC_GSMONITOR_LOG_ON_OPEN = true|false
                                3. LoadLeveler creates temporary log files used by the starter daemon.
                                   These files are used for synchronization purposes. When a job starts, a
                                   StarterLog.pid file is created. When the job ends, this file is appended to
                                   the StarterLog file.
                                4. Normally, only those who are installing or debugging LoadLeveler will
                                   need to use the debug flags, described in “Controlling debugging
                                   output” on page 51 The default error logging, obtained by leaving the
                                   right side of the debug control statement null, will be sufficient for most
                                   installations.

                        Controlling the logging buffer
                        LoadLeveler allows a LoadLeveler daemon to store log messages in a buffer in
                        memory instead of writing the messages to a log file.

                        The administrator can force the messages in this buffer to be written to the log file,
                        when necessary, to diagnose a problem. The buffer is circular and once it is full,
                        older messages are discarded as new messages are logged. The llctl dumplogs
                        command is used to write the contents of the logging buffer to the appropriate log
                        file for the Master, Negotiator, Schedd, and Startd daemons.

                        Buffering will be disabled if either the buffer length is 0 or no additional debug
                        flags are specified for buffering.

                        See “Configuring recording activity and log files” on page 48 for log control
                        statement specifications. See TWS LoadLeveler: Diagnosis and Messages Guide for
                        additional information on TWS LoadLeveler log files.

                        Logging buffer example:

                        With the following configuration, the Schedd daemon will write only D_ALWAYS
                        and D_SCHEDD messages to the ${LOG}/SchedLog log file. The following
                        messages will be stored in the buffer:
                        v D_ALWAYS
                        v D_SCHEDD
                        v D_LOCKING
                        The maximum size of the Schedd log is 64 MB and the size of the logging buffer is
                        32 MB.
                        SCHEDD_LOG = ${LOG}/SchedLog
                        MAX_SCHEDD_LOG = 64000000 [32000000]
                        SCHEDD_DEBUG = D_SCHEDD [D_LOCKING]

                        To write the contents of the logging buffer to SchedLog file on the machine, issue
                        llctl dumplogs




50   TWS LoadLeveler: Using and Administering
To write the contents of the logging buffer to the SchedLog file on node1 in the
LoadLeveler cluster, issue:
llctl -h node1 dumplogs

To write the contents of the logging buffers to the SchedLog files on all machines,
issue:
llctl -g dumplogs

Note that the messages written from the logging buffer include a bracket message
and a prefix to identify them easily.
=======================BUFFER BEGIN========================

BUFFER: message .....
BUFFER: message .....

=======================BUFFER END==========================

Controlling debugging output
You can control the level of debugging output logged by LoadLeveler programs.

The following flags are presented here for your information, though they are used
primarily by IBM personnel for debugging purposes:
D_ACCOUNT
        Logs accounting information about processes. If used, it may slow down
        the network.
D_ACCOUNT_DETAIL
        Logs detailed accounting information about processes. If used, it may slow
        down the network and increase the size of log files.
D_ADAPTER
        Logs messages related to adapters.
D_AFS
        Logs information related to AFS credentials.
D_CKPT
        Logs information related to checkpoint and restart
D_DAEMON
        Logs information regarding basic daemon set up and operation, including
        information on the communication between daemons.
D_DBX
        Bypasses certain signal settings to permit debugging of the processes as
        they execute in certain critical regions.
D_EXPR
        Logs steps in parsing and evaluating control expressions.
D_FAIRSHARE
        Displays messages related to fair share scheduling in the daemon logs. In
        the global configuration file, D_FAIRSHARE can be added to
        SCHEDD_DEBUG and NEGOTIATOR_DEBUG.
D_FULLDEBUG
        Logs details about most actions performed by each daemon but doesn’t log
        as much activity as setting all the flags.
D_HIERARCHICAL
        Used to enable messages relating to problems related to the transmission of
        hierarchical messages. A hierarchical message is sent from an originating
        node to lower ranked receiving nodes.
D_JOB
        Logs job requirements and preferences when making decisions regarding
        whether a particular job should run on a particular machine.

                                    Chapter 4. Configuring the LoadLeveler environment   51
D_KERNEL
                              Activates diagnostics for errors involving the process tracking kernel
                              extension.
                        D_LOAD
                              Displays the load average on the startd machine.
                        D_LOCKING
                              Logs requests to acquire and release locks.
                        D_LXCPUAFNT
                              Logs messages related to Linux CPU affinity. This flag is only valid for the
                              startd daemon.
                        D_MACHINE
                              Logs machine control functions and variables when making decisions
                              regarding starting, suspending, resuming, and aborting remote jobs.
                        D_MUSTER
                              Logs information related to multicluster processing.
                        D_NEGOTIATE
                              Displays the process of looking for a job to run in the negotiator. It only
                              pertains to this daemon.
                        D_PCRED
                              Directs that extra debug should be written to a file if the setpcred()
                              function call fails.
                        D_PROC
                              Logs information about jobs being started remotely such as the number of
                              bytes fetched and stored for each job.
                        D_QUEUE
                              Logs changes to the job queue.
                        D_REFCOUNT
                              Logs activity associated with reference counting of internal LoadLeveler
                              objects.
                        D_RESERVATION
                              Logs reservation information in the negotiator and Schedd daemon logs.
                              D_RESERVATION can be added to SCHEDD_DEBUG and
                              NEGOTIATOR_DEBUG.
                        D_RESOURCE
                              Logs messages about the management and consumption of resources.
                              These messages are recorded in the negotiator log.
                        D_SCHEDD
                              Displays how the Schedd works internally.
                        D_SDO
                              Displays messages detailing LoadLeveler objects being transmitted between
                              daemons and commands.
                        D_SECURITY
                              Logs information related to Cluster Security (CtSec) services identities.
                        D_SPOOL
                              Logs information related to the usage of databases in the LoadLeveler
                              spool directory.
                        D_STANZAS
                              Displays internal information about the parsing of the administration file.
                        D_STARTD
                              Displays how the startd works internally.
                        D_STARTER
                              Displays how the starter works internally.
                        D_STREAM
                              Displays messages detailing socket I/O.



52   TWS LoadLeveler: Using and Administering
D_SWITCH
      Logs entries related to switch activity and LoadLeveler Switch Table Object
      data.
D_THREAD
      Displays the ID of the thread producing the log message. The thread ID is
      displayed immediately following the date and time. This flag is useful for
      debugging threaded daemons.
D_XDR
      Logs information regarding External Data Representation (XDR)
      communication protocols.
        For example:
        SCHEDD_DEBUG = D_CKPT   D_XDR

        Causes the scheduler to log information about checkpointing user jobs and
        exchange xdr messages with other LoadLeveler daemons. These flags will
        primarily be of interest to LoadLeveler implementers and debuggers.

The LL_COMMAND_DEBUG environment variable can be set to a string of
debug flags the same way as the *_DEBUG configuration keywords are set.
Normally, LoadLeveler commands and APIs do not print debug messages, but
with this environment variable set, the requested classes of debugging messages
will be logged to stderr. For example:
LL_COMMAND_DEBUG="D_ALWAYS D_STREAM" llstatus

will cause the llstatus command to print out debug messages related to I/O to
stderr.

Saving log files
By default, LoadLeveler stores only the two most recent iterations of a daemon’s
log file (<daemon name>Log, and <daemon name>Log.old).

Occasionally, for problem diagnosing, users will need to capture LoadLeveler logs
over an extended period. Users can specify that all log files be saved to a
particular directory by using the SAVELOGS keyword in a local or global
configuration file. Be aware that LoadLeveler does not provide any way to manage
and clean out all of those log files, so users must be sure to specify a directory in a
file system with enough space to accommodate them. This file system should be
separate from the one used for the LoadLeveler log, spool, and execute directories.

Each log file is represented by the name of the daemon that generated it, the exact
time the file was generated, and the name of the machine on which the daemon is
running. When you list the contents of the SAVELOGS directory, the list of log file
names looks like this:
NegotiatorLogNov02.16:10:39.123456.c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:42.987654.c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:46.564123.c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:48.234345.c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:51.123456.c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:53.566987.c163n10.ppd.pok.ibm.com
StarterLogNov02.16:09:19.622387.c163n10.ppd.pok.ibm.com
StarterLogNov02.16:09:51.499823.c163n10.ppd.pok.ibm.com
StarterLogNov02.16:10:30.876546.c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:05.543677.c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:26.688901.c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:47.443556.c163n10.ppd.pok.ibm.com
SchedLogNov02.16:10:12.712680.c163n10.ppd.pok.ibm.com
SchedLogNov02.16:10:37.342156.c163n10.ppd.pok.ibm.com

                                        Chapter 4. Configuring the LoadLeveler environment   53
StartLogNov02.16:09:05.697753.c163n10.ppd.pok.ibm.com
                        StartLogNov02.16:09:26.881234.c163n10.ppd.pok.ibm.com
                        StartLogNov02.16:09:47.231234.c163n10.ppd.pok.ibm.com
                        StartLogNov02.16:10:12.125556.c163n10.ppd.pok.ibm.com
                        StartLogNov02.16:10:37.961486.c163n10.ppd.pok.ibm.com

                        For information about configuration file keyword syntax and other details, see
                        Chapter 12, “Configuration file reference,” on page 263.

             Setting up file system monitoring
                        You can use the file system keywords to monitor the file system space or inodes
                        used by LoadLeveler.

                        You can use the file system keywords to monitor the file system space or inodes
                        used by LoadLeveler for:
                        v Logs
                        v Saving executables
                        v Spool information
                        v History files

                        You can also use the file system keywords to take preventive action and avoid
                        problems caused by running out of file system space or inodes. This is done by
                        setting the frequency that LoadLeveler checks the file system free space or inodes
                        and by setting the upper and lower thresholds that initialize system responses to
                        the free space or inodes available. By setting a realistic span between the lower and
                        upper thresholds, you will avoid excessive system actions.

                        The file system monitoring keywords are:
                        v FS_INTERVAL
                        v FS_NOTIFY
                        v FS_SUSPEND
                        v FS_TERMINATE
                        v INODE_NOTIFY
                        v INODE_SUSPEND
                        v INODE_TERMINATE

                        For information about configuration file keyword syntax and other details, see
                        Chapter 12, “Configuration file reference,” on page 263.

Defining LoadLeveler machine characteristics
                        You can use the following keywords to define the characteristics of machines in the
                        LoadLeveler cluster.

                        For information about configuration file keyword syntax and other details, see
                        Chapter 12, “Configuration file reference,” on page 263.
                        v ARCH
                        v CLASS
                        v CUSTOM_METRIC
                        v CUSTOM_METRIC_COMMAND
                        v FEATURE
                        v GSMONITOR_RUNS_HERE
                        v MAX_STARTERS
                        v SCHEDD_RUNS_HERE
                        v SCHEDD_SUBMIT_AFFINITY
                        v STARTD_RUNS_HERE

54   TWS LoadLeveler: Using and Administering
v START_DAEMONS
      v VM_IMAGE_ALGORITHM
      v X_RUNS_HERE

Defining job classes that a LoadLeveler machine will accept
      There are a number of possible ways of defining job classes.

      The following examples illustrate possible ways of defining job classes.
      v Example 1
        This example specifies multiple classes:
        Class = No_Class(2)

        or
        Class = { "No_Class" "No_Class" }

        The machine will only run jobs that have either defaulted to or explicitly
        requested class No_Class. A maximum of two LoadLeveler jobs are permitted to
        run simultaneously on the machine if the MAX_STARTERS keyword is not
        specified. See “Specifying how many jobs a machine can run” for more
        information on MAX_STARTERS.
      v Example 2
        This example specifies multiple classes:
        Class = No_Class(1) Small(1) Medium(1) Large(1)

        or
        Class = { "No_Class" "Small" "Medium" "Large" }

        The machine will only run a maximum of four LoadLeveler jobs that have either
        defaulted to, or explicitly requested No_Class, Small, Medium, or Large class. A
        LoadLeveler job with class IO_bound, for example, would not be eligible to run
        here.
      v Example 3
        This example specifies multiple classes:
        Class = B(2) D(1)

        or
        Class = { "B" "B" "D" }

        The machine will run only LoadLeveler jobs that have explicitly requested class
        B or D. Up to three LoadLeveler jobs may run simultaneously: two of class B
        and one of class D. A LoadLeveler job with class No_Class, for example, would
        not be eligible to run here.

Specifying how many jobs a machine can run
      To specify how many jobs a machine can run, you need to take into consideration
      both the MAX_STARTERS keyword and the Class statement.

      This is described in more detail in “Defining LoadLeveler machine characteristics”
      on page 54.

      For example, if the configuration file contains these statements:



                                            Chapter 4. Configuring the LoadLeveler environment   55
Class = A(1) B(2) C(1)
                        MAX_STARTERS = 2

                        then the machine can run a maximum of two LoadLeveler jobs simultaneously. The
                        possible combinations of LoadLeveler jobs are:
                        v A and B
                        v A and C
                        v B and B
                        v B and C
                        v Only A, or only B, or only C

                        If this keyword is specified together with a Class statement, the maximum number
                        of jobs that can be run is equal to the lower of the two numbers. For example, if:
                        MAX_STARTERS = 2
                        Class = class_a(1)

                        then the maximum number of job steps that can be run is one (the Class statement
                        defines one class).

                        If you specify MAX_STARTERS keyword without specifying a Class statement, by
                        default one class still exists (called No_Class). Therefore, the maximum number of
                        jobs that can be run when you do not specify a Class statement is one.

                        Note: If the MAX_STARTERS keyword is not defined in either the global
                              configuration file or the local configuration file, the maximum number of
                              jobs that the machine can run is equal to the number of classes in the Class
                              statement.

Defining security mechanisms
                        LoadLeveler can be configured to control authentication and authorization of
                        LoadLeveler functions by using Cluster Security (CtSec) services, a subcomponent
                        of Reliable Scalable Cluster Technology (RSCT), which uses the host-based
                        authentication (HBA) as an underlying security mechanism.

                        LoadLeveler permits only one security service to be configured at a time. You can
                        skip this topic if you do not plan to use this security feature or if you plan to use
                        the credential forwarding provided by the llgetdce and llsetdce program pair.
                        Refer to “Using the alternative program pair: llgetdce and llsetdce” on page 75 for
                        more information.

                        LoadLeveler for Linux does not support CtSec security.

                        LoadLeveler can be enabled to interact with OpenSSL for secure multicluster
                        communications

                        Table 14 on page 57 lists the topics that explain LoadLeveler daemons and how
                        you may define their characteristics and modify their behavior.




56   TWS LoadLeveler: Using and Administering
Table 14. Roadmap of configuration tasks for securing LoadLeveler operations
      To learn about:                 Read the following:
      Securing LoadLeveler            v “Configuring LoadLeveler to use cluster security
      operations using cluster          services”
      security services
                                      v “Steps for enabling CtSec services” on page 58
                                      v “Limiting which security mechanisms LoadLeveler can
                                        use” on page 60
      Enabling LoadLeveler to secure “Steps for securing communications within a LoadLeveler
      multicluster communication     multicluster” on page 153
      with OpenSSL
      Correctly specifying            Chapter 12, “Configuration file reference,” on page 263
      configuration file keywords



Configuring LoadLeveler to use cluster security services
      Cluster security (CtSec) services allows a software component to authenticate and
      authorize the identity of one of its peers or clients.

      When configured to use CtSec services, LoadLeveler will:
      v Authenticate the identity of users and programs interacting with LoadLeveler.
      v Authorize users and programs to use LoadLeveler services. It prevents
        unauthorized users and programs from misusing resources or disrupting
        services.

      To use CtSec services, all nodes running LoadLeveler must first be configured as
      part of a cluster running Reliable Scalable Cluster Technology (RSCT). For details
      on CtSec services administration, see IBM Reliable Scalable Cluster Technology:
      Administration Guide, SA22-7889.

      CtSec services are designed to use multiple security mechanisms and each security
      mechanism must be configured for LoadLeveler. At the present time, directions are
      provided only for configuring the host-based authentication (HBA) security
      mechanism for LoadLeveler’s use. If CtSec is configured to use additional security
      mechanisms that are not configured for LoadLeveler’s use, then the LoadLeveler
      configuration file keyword SEC_IMPOSED_MECHS must be specified. This
      keyword is used to limit the security mechanisms that will be used by CtSec
      services to only those that are configured for use by LoadLeveler.

      Authorization is based on user identity. When CtSec services are enabled for
      LoadLeveler, user identity will differ depending on the authentication mechanism
      in use. A user’s identity in UNIX host-based authentication is the user’s network
      identity which is comprised of the user name and host name, such as
      user_name@host.

      LoadLeveler uses CtSec services to authorize owners of jobs, administrators and
      LoadLeveler daemons to perform certain actions. CtSec services uses its own
      identity mapping file to map the clients’ network identity to a local identity when
      performing authorizations. A typical local identity is the user name without a
      hostname. The local identities of the LoadLeveler administrators must be added as
      members of the group specified by the keyword SEC_ADMIN_GROUP. The local
      identity of the LoadLeveler user name must be added as the sole member of the
      group specified by the keyword SEC_SERVICES_GROUP. The LoadLeveler
      Services and Administrative groups, those identified by the keywords


                                             Chapter 4. Configuring the LoadLeveler environment   57
SEC_SERVICES_GROUP and SEC_ADMIN_GROUP, must be the same across all
                        nodes in the LoadLeveler cluster. To ensure consistency in performing tasks which
                        require owner, administrative or daemon privileges across all nodes in the
                        LoadLeveler cluster, user network identities must be mapped identically across all
                        nodes in the LoadLeveler cluster. If this is not the case, LoadLeveler authorizations
                        may fail.

                        Steps for enabling CtSec services
                        It is necessary to enable LoadLeveler to use CtSec services.

                        To enable LoadLeveler to use CtSec services, perform the following steps:
                        1. Include, in the Trusted Host List, the host names of all hosts with which
                           communications may take place. If LoadLeveler tries to communicate with a
                           host not on the Trusted Host List the message: The host identified in the
                           credentials is not a trusted host on this system will occur. Additionally, the
                           system administrator should ensure that public keys are manually exchanged
                           between all hosts in the LoadLeveler cluster. Refer to IBM Reliable Scalable
                           Cluster Technology: Administration Guide, SA22-7889 for information on setting
                           up Trusted Host Lists and manually transferring public keys.
                        2. Create user IDs. Each LoadLeveler administrator and the LoadLeveler user ID
                           need to be created, if they don’t already exist. You can do this through SMIT or
                           the mkuser command.
                        3. Ensure that the unix.map file contains the correct value for the service name
                           ctloadl which specifies the LoadLeveler user name. If you have configured
                           LoadLeveler to use loadl as the LoadLeveler user name, either by default or by
                           specifying loadl in the LoadLUserid keyword of the /etc/LoadL.cfg file, nothing
                           needs to be done. The default map file will contain the ctloadl service name
                           already assigned to loadl. If you have configured a different user name in the
                           LoadLUserid keyword of the /etc/LoadL.cfg file, you will need to make sure
                           that the /var/ct/cfg/unix.map file exists and that it assigns the same user name
                           to the ctloadl service name. If the /var/ct/cfg/unix.map file does not exist, create
                           one by copying the default map file /usr/sbin/rsct/cfg/unix.map. Do not modify
                           the default map file.
                           If the value of the LoadLUserid and the value associated with ctloadl are not
                           the same a security services error which indicates a UNIX identity mismatch
                           will occur.
                        4. Add entries to the global mapping file of each machine in the LoadLeveler
                           cluster to map network identities to local identities. This file is located at:
                           /var/ct/cfg/ctsec_map.global. If this file doesn’t yet exist, you should copy the
                           default global mapping file to this location—don’t modify the default mapping
                           file. The default global mapping file, which is shared among all CtSec services
                           exploiters, is located at /usr/sbin/rsct/cfg/ctsec_map.global. See IBM Reliable
                           Scalable Cluster Technology for AIX: Technical Reference, SA22-78900 for more
                           information on the mapping file.
                           When adding names to the global mapping file, enter more specific entries
                           ahead of the other, less specific entries. Remember that you must update the
                           global mapping file on each machine in the LoadLeveler cluster, and each
                           mapping file has to be updated with the security services identity of each
                           member of the administrator group, the services group, and the users.
                           Therefore, you would have entries like this:
                            unix:brad@mach1.pok.ibm.com=bradleyf
                            unix:brad@mach2.pok.ibm.com=bradleyf
                            unix:brad@mach3.pok.ibm.com=bradleyf
                            unix:marsha@mach2.pok.ibm.com=marshab


58   TWS LoadLeveler: Using and Administering
unix:marsha@mach3.pok.ibm.com=marshab
   unix:loadl@mach1.pok.ibm.com=loadl
   unix:loadl@mach2.pok.ibm.com=loadl
   unix:loadl@mach3.pok.ibm.com=loadl

   However, if you’re sure your LoadLeveler cluster is secure, you could specify
   mapping for all machines this way:
   unix:brad@*=bradleyf
   unix:marsha@*=marshab
   unix:loadl@*=loadl

   This indicates that the UNIX network identity of the users brad, marsha and
   loadl will map to their respective security services identities on every machine
   in the cluster. Refer to IBM Reliable Scalable Cluster Technology for AIX: Technical
   Reference, SA22-7800 for a description of the syntax used in the
   ctsec_map.global file.
5. Create UNIX groups. The LoadLeveler administrator group and services group
   need to be created for every machine in the cluster and should contain the local
   identities of members. This can be done either by using SMIT or the mkgroup
   command.
   For example, to create the group lladmin which lists the LoadLeveler
   administrators:
   mkgroup "users=sam,betty,loadl" lladmin

   These groups must be created on each machine in the LoadLeveler cluster and
   must contain the same entries.
   To create the group llsvcs which lists the identity under which LoadLeveler
   daemons run using the default id of loadl:
   mkgroup users=loadl llsvcs

   These groups must be created on each machine in the LoadLeveler cluster and
   must contain the same entries.
6. Add or update these keywords in the LoadLeveler configuration file:
   SEC_ENABLEMENT=CTSEC
   SEC_ADMIN_GROUP=name of lladmin group
   SEC_SERVICES_GROUP=group name that contains identities of LoadLeveler daemons

   The SEC_ENABLEMENT=CTSEC keyword indicates that CtSec services
   mechanism should be used. SEC_ADMIN_GROUP points to the name of the
   UNIX group which contains the local identities of the LoadLeveler
   administrators. The SEC_SERVICES_GROUP keyword points to the name of
   the UNIX group which contains the local identity of the LoadLeveler daemons.
   All LoadLeveler daemons run as the LoadLeveler user ID. Refer to step 5 for
   discussion of the administrators and services groups.
7. Update the .rhosts file in each user’s home directory. This file is used to
   identify which UNIX identities can run LoadLeveler jobs on the local host
   machine. If the file does not exist in a user’s home directory, you must create it.
   The .rhosts file must contain entries which specify all host and user
   combinations allowed to submit jobs which will run as the local user, either
   explicitly or through the use of wildcards.
   Entries in the .rhosts file are specified this way:
   HostNameField [UserNameField]

   Refer to IBM AIX Files Reference, SC23-4168 for further details about the .rhosts
   file format.

                                     Chapter 4. Configuring the LoadLeveler environment   59
Tips for configuring LoadLeveler to use CtSec services:

                        When using CtSec services for LoadLeveler, each machine in the LoadLeveler
                        cluster must be set up properly.

                        CtSec authenticates network identities based on trust established between
                        individual machines in a cluster, based on local host configurations. Because of this
                        it is possible for most of the cluster to run correctly but to have transactions from
                        certain machines experience authentication or authorization problems.

                        If unexpected authentication or authorization problems occur in a LoadLeveler
                        cluster with CtSec enabled, check that the steps in “Steps for enabling CtSec
                        services” on page 58 were correctly followed for each machine in the LoadLeveler
                        cluster.

                        If any machine in a LoadLeveler cluster is improperly configured to run CtSec you
                        may see that:
                        v Users cannot perform user tasks (such as cancel) for jobs they submitted.
                           Either the machine the job was submitted from or the machine the user
                           operation was submitted from (or both) do not contain mapping files for the
                           user that specify the same security services identity. The user should attempt the
                           operation from the same machine the job was submitted from and record the
                           results. If the user still cannot perform a user task on a job they submitted, then
                           they should contact the LoadLeveler administrator who should review the steps
                           in “Steps for enabling CtSec services” on page 58.
                        v LoadLeveler daemons fail to communicate.
                           When LoadLeveler daemons communicate they must first authenticate each
                           other. If the daemons cannot authenticate a message will be put in the daemon
                           log indicating an authentication failure. Ensure the Trusted Hosts List on all
                           LoadLeveler nodes contains the correct entries for all of the nodes in the
                           LoadLeveler cluster. Also, make sure that the LoadLeveler Services group on all
                           nodes of the LoadLeveler cluster contains the local identity for the LoadLeveler
                           user name. The ctsec_map.global must contain mapping rules to map the
                           LoadLeveler user name from every machine in the LoadLeveler cluster to the
                           local identity for the LoadLeveler user name. An example of what may happen
                           when daemons fail to communicate is that an alternate central manager may
                           take over while the primary central manager is still active. This can occur when
                           the alternate central manager does not trust the primary central manager.

                        Limiting which security mechanisms LoadLeveler can use
                        As more security mechanisms become available, they must be configured for
                        LoadLeveler’s use.

                        If there are security mechanisms configured for CtSec that are not configured for
                        LoadLeveler’s use, then the LoadLeveler configuration file keyword
                        SEC_IMPOSED_MECHS must specify the mechanisms configured for
                        LoadLeveler.

Defining usage policies for consumable resources
                        The LoadLeveler scheduler can schedule jobs based on the availability of
                        consumable resources.

                        You can use the following keywords to configure consumable resources:
                        v ENFORCE_RESOURCE_MEMORY

60   TWS LoadLeveler: Using and Administering
v   ENFORCE_RESOURCE_POLICY
                  v   ENFORCE_RESOURCE_SUBMISSION
                  v   ENFORCE_RESOURCE_USAGE
                  v   FLOATING_RESOURCES
                  v   RESOURCES
                  v   SCHEDULE_BY_RESOURCES

                  For information about configuration file keyword syntax and other details, see
                  Chapter 12, “Configuration file reference,” on page 263.

    Enabling support for bulk data transfer and rCxt blocks
                  On systems with device drivers and network adapters that support remote
                  direct-memory access (RDMA), LoadLeveler allows bulk data transfer for jobs that
                  use either the Internet or user space communication protocol mode.

                  For jobs using the Internet protocol (IP jobs), LoadLeveler does not monitor or
                  control the use of bulk transfer. For user space jobs that request bulk transfer,
                  however, LoadLeveler creates a consumable RDMA resource, and limits RDMA
                  resources to only four for a single machine with Switch Network Interface for HPS
                  network adapters. There is no limit on RDMA resources for machines with
                  InfiniBand network adapters.

                  You do not need to perform specific configuration or job-definition tasks to enable
                  bulk transfer for LoadLeveler jobs that use the IP network protocol. LoadLeveler
                  cannot affect whether IP communication uses bulk transfer; the implementation of
                  IP where the job runs determines whether bulk transfer is supported.

                  To enable user space jobs to use bulk data transfer, you must update the
                  LoadLeveler configuration file to include the value RDMA in the
                  SCHEDULE_BY_RESOURCES list for machines with Switch Network Interface for
                  HPS network adapters.

                  Example:
                  SCHEDULE_BY_RESOURCES = RDMA others

                  For additional information about using bulk data transfer and job-definition
                  requirements, see “Using bulk data transfer” on page 188.

    Gathering job accounting data
                  Your organization may have a policy of charging users or groups of users for the
                  amount of resources that their jobs consume.

                  You can do this using LoadLeveler’s accounting feature. Using this feature, you can
                  produce accounting reports that contain job resource information for completed
|                 serial and parallel job steps. You can also view job resource information on jobs
                  that are continuing to run.

                  The accounting record for a job step will contain separate sets of resource usage
                  data for each time a job step is dispatched to run. For example, the accounting
                  record for a job step that is vacated and then started again will contain two sets of
                  resource usage data. The first set of resource usage data is for the time period
                  when the job step was initially dispatched until the job step was vacated. The
                  second set of resource usage data is for the time period for when the job step is
                  dispatched after the vacate until the job step completes.

                                                        Chapter 4. Configuring the LoadLeveler environment   61
The job step’s accounting data that is provided in the llsummary short listing and
                            in the user mail will contain only one set of resource usage data. That data will be
                            from the last time the job step was dispatched to run. For example, the mail
                            message for job step completion for a job step that is checkpointed with the hold
                            (-h) option and then restarted, will contain the set of resource usage data only for
                            the dispatch that restarted the job from the checkpoint. To obtain the resource
                            usage data for the entire job step, use the detailed llsummary command or
                            accounting API.

                            The following keywords allow you to control accounting functions:
                            v ACCT
                            v ACCT_VALIDATION
                            v GLOBAL_HISTORY
                            v HISTORY_PERMISSION
                            v JOB_ACCT_Q_POLICY
                            v JOB_LIMIT_POLICY
                            For example, the following section of the configuration file specifies that the
                            accounting function is turned on. It also identifies the default module used to
                            perform account validation and the directory containing the global history files:
                            ACCT                    = A_ON A_VALIDATE
                            ACCT_VALIDATION         = $(BIN)/llacctval
                            GLOBAL_HISTORY          = $(SPOOL)

                            Table 15 lists the topics related to configuring, gathering and using job accounting
                            data.
                            Table 15. Roadmap of tasks for gathering job accounting data
                            To learn about:                 Read the following:
                            Configuring LoadLeveler to      v “Collecting job resource data on serial and parallel jobs”
                            gather job accounting data
                                                            v “Collecting job resource data based on machines” on page
                                                              64
                                                            v “Collecting job resource data based on events” on page 64
                                                            v “Collecting job resource information based on user
                                                              accounts” on page 65
                                                            v “Collecting accounting data for reservations” on page 63
                                                            v “Collecting the accounting information and storing it into
                                                              files” on page 66
                                                            v “64-bit support for accounting functions” on page 67
                                                            v “Example: Setting up job accounting files” on page 67
                            Managing accounting data        v “Producing accounting reports” on page 66
                                                            v “Correlating AIX and LoadLeveler accounting records” on
                                                              page 66
                                                            v “llacctmrg - Collect machine history files” on page 413
                                                            v “llsummary - Return job resource information for
                                                              accounting” on page 535
                            Correctly specifying            Chapter 12, “Configuration file reference,” on page 263
                            configuration file keywords



                 Collecting job resource data on serial and parallel jobs
|                           Information on completed serial and parallel job steps is gathered using the UNIX
|                           wait3 system call.

    62   TWS LoadLeveler: Using and Administering
Information on non-completed serial and parallel jobs is gathered in a
          platform-dependent manner by examining data from the UNIX process.

|         Accounting information on a completed serial job step is determined by
|         accumulating resources consumed by that job on the machines that ran the job.
|         Similarly, accounting information on completed parallel job steps is gathered by
|         accumulating resources used on all of the nodes that ran the job step.

          You can also view resource consumption information on serial and parallel jobs
          that are still running by specifying the -x option of the llq command. To enable llq
          -x, specify the following keywords in the configuration file:
          v ACCT = A_ON A_DETAIL
          v JOB_ACCT_Q_POLICY = number

|   Collecting accounting information for recurring jobs
|         For recurring jobs, accounting records are written as each occurrence of each step
|         of the job completes. The reservation ID field in the accounting record can be used
|         to distinguish one occurrence from another.

    Collecting accounting data for reservations
          LoadLeveler can collect accounting data for reservations, which are set periods of
          time during which node resources are reserved for the use of particular users or
          groups.

          To enable recording of reservation information, specify the following keywords in
          the configuration file:
          v To turn on accounting for reservations, add the A_RES flag to the ACCT
            keyword.
          v To specify a file other than the default history file to contain the data, use the
            RESERVATION_HISTORY keyword.
          See Chapter 12, “Configuration file reference,” on page 263 for details about the
          ACCT and RESERVATION_HISTORY keywords.

          When these keyword values are set and a reservation ends or is canceled,
          LoadLeveler records the following information:
          v The reservation ID
          v The time at which the reservation was created
          v The user ID of the reservation owner
          v The name of the owning group
          v Requested and actual start times
          v Requested and actual duration
          v Actual time at which the reservation ended or was canceled
          v Whether the reservation was created with the SHARED or REMOVE_ON_IDLE options
          v A list of users and a list of groups that were authorized to use the reservation
          v The number of reserved nodes
          v The names of reserved nodes

          This reservation information is appended in a single line to the reservation history
          file for the reservation. The format of reservation history data is:
          Reservation ID!Reservation Creation Time!Owner!Owning Group!Start Time! 
           Actual Start Time!Duration!Actual Duration!Actual End Time!SHARED(yes|no)! 
          REMOVE_ON_IDLE(yes|no)!Users!Groups!Number of Nodes!Nodes!BG C-nodes! 
           BG Connection!BG Shape!Number of BG BPs!BG BPs

          In reservation history data:

                                               Chapter 4. Configuring the LoadLeveler environment   63
v The unit of measure for start times and end times is the number of seconds since
                              January 1, 1970.
                            v The unit of time for durations is seconds.

|                           Note: As each occurrence of a recurring reservation completes, an accounting
|                                 record is appended to the reservation history file. The format of the record is
|                                 identical to that of a one time reservation. In the record, the Reservation ID
|                                 includes the occurrence ID of the completed reservation.

|                                   When you cancel the entire recurring reservation (as opposed to only one
|                                   occurrence being canceled), one additional accounting record is written. This
|                                   record is based on the state of the reservation:
|                                   v If an occurrence is ACTIVE, then the end time and duration of that
|                                      occurrence is set and an accounting record written.
|                                   v If there are not any ACTIVE occurrences, then an accounting record will
|                                      be written for the next scheduled occurrence. This is similar to the
|                                      accounting record that is written when you cancel a one time reservation
|                                      in the WAITING state.

                            The following is an example of a reservation history file entry:
                            bgldd1.rchland.ibm.com.68.r!1150242970!ezhong!group1!1150243200!1150243200! 
                             300!300!1150243500!no!no!yang!fvt,dev!1!bgldd1!0!!!0!
                            bgldd1.rchland.ibm.com.54.r!1150143472!ezhong!No_Group!1153612800!0!60!0! 
                             1150243839!no!no!!!0!32!MESH!0x0x0!1!R010(J115)
                            bgldd1.rchland.ibm.com.70.r!1150244654!ezhong!No_Group!1150244760!1150244760! 
                             60!60!1150244820!yes!yes!user1,user2!group1,group2!0!512!MESH!1x1x1!1!R010

|                           To collect the reservation information stored in the history file, use the llacctmrg
|                           command with the -R option. For llacctmrg command syntax, see “llacctmrg -
|                           Collect machine history files” on page 413.

                            To format reservation history data contained in a file, use the sample script
                            llreshist.pl in directory ${RELEASEDIR}/samples/llres/.

                 Collecting job resource data based on machines
                            LoadLeveler can collect job resource usage information for every machine on
                            which a job may run.

                            A job may run on more than one machine because it is a parallel job or because the
                            job is vacated from one machine and rescheduled to another machine.

                            To enable recording of resources by machine, you need to specify ACCT = A_ON
                            A_DETAIL in the configuration file.

                            The machine’s speed is part of the data collected. With this information, an
                            installation can develop a charge back program which can charge more or less for
                            resources consumed by a job on different machines. For more information on a
                            machine’s speed, refer to the machine stanza information. See “Defining machines”
                            on page 84.

                 Collecting job resource data based on events
                            In addition to collecting job resource information based upon machines used, you
                            can gather this information based upon an event or time that you specify.




    64   TWS LoadLeveler: Using and Administering
For example, you may want to collect accounting information at the end of every
      work shift or at the end of every week or month. To collect accounting information
      on all machines in this manner, use the llctl command with the capture parameter:
      llctl -g capture eventname

      eventname is any string of continuous characters (no white space is allowed) that
      defines the event about which you are collecting accounting data. For example, if
      you were collecting accounting data on the graveyard work shift, your command
      could be:
      llctl -g capture graveyard

      This command allows you to obtain a snapshot of the resources consumed by
      active jobs up to and including the moment when you issued the command. If you
      want to capture this type of information on a regular basis, you can set up a
      crontab entry to invoke this command regularly. For example:
      # sample crontab for accounting
      # shift crontab 94/8/5
      #
      # Set up three shifts, first, second, and graveyard shift.
      # Crontab entries indicate the end of shift.
      #
      #M H d m day command
      #
      00 08 * * * /u/loadl/bin/llctl -g capture graveyard
      00 16 * * * /u/loadl/bin/llctl -g capture first
      00 00 * * * /u/loadl/bin/llctl -g capture second

      For more information on the llctl command, refer to “llctl - Control LoadLeveler
      daemons” on page 439. For more information on the collection of accounting
      records, see “llq - Query job status” on page 479.

Collecting job resource information based on user accounts
      If your installation is interested in keeping track of resources used on an account
      basis, you can require all users to specify an account number in their job command
      files.

      They can specify this account number with the account_no keyword which is
      explained in detail in “Job command file keyword descriptions” on page 359.
      Interactive POE jobs can specify an account number using the
      LOADL_ACCOUNT_NO environment variable.

      LoadLeveler validates this account number by comparing it against a list of
      account numbers specified for the user in the user stanza in the administration file.

      Account validation is under the control of the ACCT keyword in the configuration
      file. The routine that performs the validation is called llacctval. You can supply
      your own validation routine by specifying the ACCT_VALIDATION keyword in
      the configuration file. The following are passed as character string arguments to
      the validation routine:
      v User name
      v User’s login group name
      v Account number specified on the Job
      v Blank-separated list of account numbers obtained from the user’s stanza in the
         administration file.
      The account validation routine must exit with a return code of zero if the
      validation succeeds. If it fails, the return code is a nonzero number.

                                          Chapter 4. Configuring the LoadLeveler environment   65
Collecting the accounting information and storing it into files
                        LoadLeveler stores the accounting information that it collects in a file called history
                        in the spool directory of the machine that initially scheduled this job, the Schedd
                        machine. Data on parallel jobs are also stored in the history files.

                        Resource information collected on the LoadLeveler job is constrained by the
                        capabilities of the wait3 system call. Information for processes which fork child
                        processes will include data for those child processes as long as the parent process
                        waits for the child process to terminate. Complete data may not be collected for
                        jobs which are not composed of simple parent/child processes. For example, if you
                        have a LoadLeveler job which invokes an rsh command to execute a function on
                        another machine, the resources consumed on the other machine will not be
                        collected as part of the LoadLeveler accounting data.

                        LoadLeveler accounting uses the following types of files:
                        v The local history file which is local to each Schedd machine is where job
                          resource information is first recorded. These files are usually named history and
                          are located in the spool directory of each Schedd machine, but you may specify
                          an alternate name with the HISTORY keyword in either the global or local
                          configuration file.
                        v The global history file is a combination of the history files from some or all of
                          the machines in the LoadLeveler cluster merged together. The command
                          llacctmrg is used to collect files together into a global file. As the files are
                          collected from each machine, the local history file for that machine is reset to
                          contain no data. The file is named globalhist.YYYYMMDDHHmm. You may
                          specify the directory in which to place the file when you invoke the llacctmrg
                          command or you can specify the directory with the GLOBAL_HISTORY
                          keyword in the configuration file. The default value set up in the sample
                          configuration file is the local spool directory.

             Producing accounting reports
                        You can produce three types of reports using either the local or global history file.

                        These reports are called the short, long, and extended versions. As their names
                        imply, the short version of the report is a brief listing of the resources used by
                        LoadLeveler jobs. The long version provides more comprehensive detail with
                        summarized resource usage, and the extended version of the report provides the
                        comprehensive detail with detailed resource usage.

                        If you do not specify a report type, you will receive the default short version. The
                        short report displays the number of jobs along with the total CPU usage according
                        to user, class, group, and account number. The extended version of the report
                        displays all of the data collected for every job.
                        v For examples of the short and extended versions of the report, see “llsummary -
                           Return job resource information for accounting” on page 535.
                        v For information on the accounting APIs, refer to Chapter 17, “Application
                           programming interfaces (APIs),” on page 541.

             Correlating AIX and LoadLeveler accounting records
                        For jobs running on AIX systems, you can use a job accounting key to correlate
                        AIX accounting records with LoadLeveler accounting records.

                        The job accounting key uniquely identifies each job step. LoadLeveler derives this
                        key from the job key and the date and time at which the job entered the queue

66   TWS LoadLeveler: Using and Administering
(see the QDate variable description). The key is associated with the starter process
      for the job step and any of its child processes.

      For checkpointed jobs, LoadLeveler does not change the job accounting key,
      regardless of how it restarts the job step. Jobs restarted from a checkpoint file or
      through a new job step retain the job accounting key for the original job step.

      To access the job accounting key for a job step, you can use the following
      interfaces:
      v The llsummary command, requesting the long version of the report. For details
         about using this command, see “llsummary - Return job resource information for
         accounting” on page 535.
      v The GetHistory subroutine. For details about using this subroutine, see
         “GetHistory subroutine” on page 545.
      v The ll_get_data subroutine, through the LL_StepAcctKey specification. For
         details about using this subroutine, see “ll_get_data subroutine” on page 570.

      For information about AIX accounting records, see the system accounting topic in
      AIX System Management Guide: Operating System and Devices.

64-bit support for accounting functions
      LoadLeveler 64-bit support for accounting functions includes several features.

      LoadLeveler 64-bit support for accounting functions includes:
      v Statistics of jobs such as usage, limits, consumable resources, and other 64-bit
        integer data are preserved in the history file as rusage64, rlimit64 structures and
        as data items of type int64_t.
      v The LL_job_step structure defined in llapi.h allows access to the 64-bit data
        items either as data of type int64_t or as data of type int32_t. In the latter case,
        the returned values may be truncated.
      v The llsummary command displays 64-bit information where appropriate.
      v The data access API supports both 64-bit and 32-bit access to accounting and
        usage information in a history file. See “Examples of using the data access API”
        on page 633 for an example of how to use the ll_get_data subroutine to access
        information stored in a LoadLeveler history file.

Example: Setting up job accounting files
      You can perform all of the steps included in this sample procedure or just the ones
      that apply to your situation.

      The sample procedure shown in Table 16 walks you through the process of
      collecting account data.
      1. Edit the configuration file according to the following table:
      Table 16. Collecting account data - modifying the configuration file
      Edit this keyword:        To:
      ACCT                      Turn accounting and account validation on and off and specify
                                detailed accounting.
      ACCT_VALIDATION           Specify the account validation routine.
      GLOBAL_HISTORY            Specify a directory in which to place the global history files.




                                               Chapter 4. Configuring the LoadLeveler environment   67
2. Specify account numbers and set up account validation by performing the
                           following steps:
                           a. Specify a list of account numbers a user may use when submitting jobs, by
                               using the account keyword in the user stanza in the administration file.
                           b. Instruct users to associate an account number with their job, by using the
                               account_no keyword in the job command file.
                           c. Specify the ACCT_VALIDATION keyword in the configuration file that
                               identifies the module that will be called to perform account validation. The
                               default module is called llacctval. You can replace this module with your
                               installation’s own accounting routine by specifying a new module with this
                               keyword.
                        3. Specify machines and their weights by using the speed keyword in a machine’s
                           machine stanza in the administration file.
                           Also, if you have in your cluster machines of differing speeds and you want
                           LoadLeveler accounting information to be normalized for these differences,
                           specify cpu_speed_scale=true in each machine’s respective machine stanza.
                           For example, suppose you have a cluster of two machines, called A and B,
                           where Machine B is three times as fast as Machine A. Machine A has
                           speed=1.0, and Machine B has speed=3.0. Suppose a job runs for 12 CPU
                           seconds on Machine A. The same job runs for 4 CPU seconds on Machine B.
                           When you specify cpu_speed_scale=true, the accounting information collected
                           on Machine B for that job shows the normalized value of 12 CPU seconds
                           rather than the actual 4 CPU seconds.
                        4. Merge multiple files collected from each machine into one file, using the
                           llacctmrg command.
                        5. Report job information on all the jobs in the history file, using the llsummary
                           command.

Managing job status through control expressions
                        You can control running jobs by using five control functions as Boolean expressions
                        in the configuration file.

                        These functions are useful primarily for serial jobs. You define the expressions,
                        using normal C conventions, with the following functions:
                        v START
                        v SUSPEND
                        v CONTINUE
                        v VACATE
                        v KILL

                        The expressions are evaluated for each job running on a machine using both the
                        job and machine attributes. Some jobs running on a machine may be suspended
                        while others are allowed to continue.

                        The START expression is evaluated twice; once to see if the machine can accept
                        jobs to run and second to see if the specific job can be run on the machine. The
                        other expressions are evaluated after the jobs have been dispatched and in some
                        cases, already running.

                        When evaluating the START expression to determine if the machine can accept
                        jobs, Class != ″Z″ evaluates to true only if Z is not in the class definition. This
                        means that if two different classes are defined on a machine, Class != ″Z″ (where Z


68   TWS LoadLeveler: Using and Administering
is one of the defined classes) always evaluates to false when specified in the
      START expression and, therefore, the machine will not be considered to start jobs.

      Typically, machine load average, keyboard activity, time intervals, and job class are
      used within these various expressions to dynamically control job execution.

      For additional information about:
      v Time-related variables that you may use for this keyword, see “Variables to use
        for setting times” on page 320.
      v Coding these control expressions in the configuration file, see Chapter 12,
        “Configuration file reference,” on page 263.

How control expressions affect jobs
      After LoadLeveler selects a job for execution, the job can be in any of several
      states.

      Figure 10 on page 70 shows how the control expressions can affect the state a job is
      in. The rectangles represent job or daemon states (Idle, Completed, Running,
      Suspended, and Vacating) and the diamonds represent the control expressions
      (Start, Suspend, Continue, Vacate, and Kill).




                                          Chapter 4. Configuring the LoadLeveler environment   69
Idle



                                                                          False
                                Completed                     Start

                                                                  True

                                                            Running



                                                                          False
                                                            Suspend

                                                                  True

                                                          Suspended



                                                                          True
                                                            Continue

                                                                  False

                                                False
                                                             Vacate

                                                                  True

                                                            Vacating



                                                                          False
                                                               Kill

                                                                  True

                        Figure 10. How control expressions affect jobs

                        Criteria used to determine when a LoadLeveler job will enter Start, Suspend,
                        Continue, Vacate, and Kill states are defined in the LoadLeveler configuration files
                        and they can be different for each machine in the cluster. They can be modified to
                        meet local requirements.

Tracking job processes
                        When a job terminates, its orphaned processes may continue to consume or hold
                        resources, thereby degrading system performance, or causing jobs to hang or fail.

                        Process tracking allows LoadLeveler to cancel any processes (throughout the entire
                        cluster), left behind when a job terminates. Process tracking is required to do
                        preemption by the suspend method when running either the BACKFILL or API
                        schedulers. Process tracking is optional in all other cases.




70   TWS LoadLeveler: Using and Administering
When process tracking is enabled, all child processes are terminated when the
              main process terminates. This will include any background or orphaned processes
              started in the prolog, epilog, user prolog, and user epilog.

              Process tracking on LoadLeveler for Linux is supported only on RHEL 5 and SLES
              10 systems.

              There are two keywords used in specifying process tracking:
              PROCESS_TRACKING
                 To activate process tracking, set PROCESS_TRACKING=TRUE in the
                 LoadLeveler global configuration file. By default, PROCESS_TRACKING is
                 set to FALSE.
              PROCESS_TRACKING_EXTENSION
                 On AIX, this keyword specifies the path to the loadable kernel module
                 LoadL_pt_ke in the local or global configuration file. If the
                 PROCESS_TRACKING_EXTENSION keyword is not supplied, then
                 LoadLeveler will search the $HOME/bin default directory.
                  On Linux, this keyword specifies the path to the loadable kernel module
                  proctrk.ko in the local or global configuration file. The proctrk.ko kernel
                  module is shipped as source code and must be built and installed on all
                  machines where process tracking is required. See the TWS LoadLeveler:
                  Installation Guide for additional information about which directory to specify
                  when using the PROCESS_TRACKING_EXTENSION configuration keyword.

              The process tracking kernel extension is not unloaded when the startd daemon
              terminates. Therefore if a mismatch in the version of the loaded kernel extension
              and the installed kernel extension is found when the startd starts up the daemon
              will exit. In this case a reboot of the node is needed to unload the currently loaded
              kernel extension. If you install a new version of LoadLeveler which contains a new
              version of the kernel extension you may need to reboot the node.

              For information about configuration file keyword syntax and other details, see
              Chapter 12, “Configuration file reference,” on page 263.

Querying multiple LoadLeveler clusters
              This topic applies only to those installations having more than one LoadLeveler
              cluster, where the separate clusters have not been organized into a multicluster
              environment.

              To organize separate LoadLeveler clusters into a multicluster environment, see
              “LoadLeveler multicluster support” on page 148.

              You can query, submit, or cancel jobs in multiple LoadLeveler clusters by setting
              up a master configuration file for each cluster and using the LOADL_CONFIG
              environment variable to specify the name of the master configuration file that the
              LoadLeveler commands must use. The master configuration file must be located in
              the /etc directory and the file name must have a format of base_name.cfg where
              base_name is a user defined identifier for the cluster.

              The default name for the master configuration file is /etc/LoadL.cfg. The format for
              the LOADL_CONFIG environment variable is LOADL_CONFIG=/etc/




                                                  Chapter 4. Configuring the LoadLeveler environment   71
base_name.cfg or LOADL_CONFIG=base_name. When you use the form
                        LOADL_CONFIG=base_name, the prefix /etc and suffix .cfg are appended to the
                        base_name.

                        The following example explains how you can set up a machine to query multiple
                        clusters:

                        You can configure /etc/LoadL.cfg to point to the configuration files for the ″default″
                        cluster, and you can configure /etc/othercluster.cfg to point to the configuration
                        files of another cluster which the user can select.

                        For example, you can enter the following query command:
                        $ llq

                        The llq command uses the configuration from /etc/LoadL.cfg and queries job
                        information from the ″default″ cluster.

                        To send a query to the cluster defined in the configuration file of
                        /etc/othercluster.cfg, enter:
                        $ env LOADL_CONFIG=othercluster llq

                        Note that the machine from which you issue the llq command is considered as a
                        submit-only machine by the other cluster.

Handling switch-table errors
                        Configuration file keywords can be used to control how LoadLeveler responds to
                        switch-table errors.

                        You may use the following configuration file keywords to control how LoadLeveler
                        responds to switch-table errors:
                        v ACTION_ON_SWITCH_TABLE_ERROR
                        v DRAIN_ON_SWITCH_TABLE_ERROR
                        v RESUME_ON_SWITCH_TABLE_ERROR_CLEAR

                        For information about configuration file keyword syntax and other details, see
                        Chapter 12, “Configuration file reference,” on page 263.

Providing additional job-processing controls through installation exits
                        LoadLeveler allows administrators to further configure the environment through
                        installation exits.

                        Table 17 lists these additional job-processing controls.
                        Table 17. Roadmap of administrator tasks accomplished through installation exits
                        To learn about:                          Read the following:
                        Writing a program to control when jobs “Controlling the central manager scheduling
                        are scheduled to run                   cycle” on page 73
                        Writing a pair of programs to override   “Handling DCE security credentials” on page 74
                        the default LoadLeveler DCE
                        authentication method
                        Writing a program to refresh an AFS      “Handling an AFS token” on page 75
                        token when a job starts



72   TWS LoadLeveler: Using and Administering
Table 17. Roadmap of administrator tasks accomplished through installation exits (continued)
      To learn about:                            Read the following:
      Writing a program to check or modify       “Filtering a job script” on page 76
      job requests when they are submitted
      Writing programs to run before and         “Writing prolog and epilog programs” on page 77
      after job requests
      Overriding the LoadLeveler default         “Using your own mail program” on page 81
      mail notification method
      Defining a cluster metric to determine     See the CLUSTER_METRIC configuration
      where a remote job is distributed          keyword description in Chapter 12, “Configuration
                                                 file reference,” on page 263.
      Defining cluster user mapper for           See the CLUSTER_USER_MAPPER configuration
      multicluster environment                   keyword description in Chapter 12, “Configuration
                                                 file reference,” on page 263.
      Correctly specifying configuration file    Chapter 12, “Configuration file reference,” on page
      keywords                                   263



Controlling the central manager scheduling cycle
      To determine when to run the LoadLeveler scheduling algorithm, the central
      manager uses the values set in the configuration file for the
      NEGOTIATOR_INTERVAL and the NEGOTIATOR_CYCLE_DELAY keywords.

      The central manager will run the scheduling algorithm every
      NEGOTIATOR_INTERVAL seconds, unless some event takes place such as the
      completion of a job or the addition of a machine to the cluster. In such cases, the
      scheduling algorithm is run immediately. When NEGOTIATOR_CYCLE_DELAY is
      set, a minimum of NEGOTIATOR_CYCLE_DELAY seconds will pass between the
      central manager’s scheduling attempts, regardless of what other events might take
      place. When the NEGOTIATOR_INTERVAL is set to zero, the central manager
      will not run the scheduling algorithm until instructed to do so by an authorized
      process. This setting enables your program to control the central manager’s
      scheduling activity through one of the following:
      v The llrunscheduler command.
      v The ll_run_scheduler subroutine.
      Both the command and the subroutine instruct the central manager to run the
      scheduling algorithm.

      You might choose to use this setting if, for example, you want to write a program
      that directly controls the assignment of the system priority for all LoadLeveler jobs.
      In this particular case, you would complete the following steps to control system
      priority assignment and the scheduling cycle:
      1. Decide the following:
          v Which system priority value to assign to jobs from specific sources or with
             specific resource requirements.
          v How often the central manager should run the scheduling algorithm. Your
             program has to be designed to issue the ll_run_scheduler subroutine at
             regular intervals; otherwise, LoadLeveler will not attempt to schedule any
             job steps.
          You also need to understand how changing the system priority affects the job
          queue. After you successfully use the ll_modify subroutine or the llmodify
          command to change system priority values, LoadLeveler will not readjust the
          values for those job steps when the negotiator recalculates priorities at regular

                                                Chapter 4. Configuring the LoadLeveler environment   73
intervals set through the
                              NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword. Also, you
                              can change the system priority for jobs only when those jobs are in the Idle
                              state or a state similar to it. To determine which job states are similar to the
                              Idle state or to the Running state, see the table in “LoadLeveler job states” on
                              page 19.
                        2.    Code a program to use LoadLeveler APIs to perform the following functions:
                              a. Use the Data Access APIs to obtain data about all jobs.
                              b. Determine whether jobs have been added or removed.
                              c. Use the ll_modify subroutine to set the system priority for the LoadLeveler
                                  jobs. The values you set through this subroutine will not be readjusted
                                  when the negotiator recalculates job step priorities.
                              d. Use the ll_run_scheduler subroutine to instruct the central manager to run
                                  the scheduling algorithm.
                              e. Set a timer for the scheduling interval, to repeat the scheduling instruction
                                  at regular intervals. This step is required to replace the effect of setting the
                                  configuration keyword NEGOTIATOR_CYCLE_DELAY, which LoadLeveler
                                  ignores when NEGOTIATOR_INTERVAL is set to zero.
                        3.    In the configuration file, set values for the following keywords:
                              v Set the NEGOTIATOR_INTERVAL keyword to zero to stop the central
                                 manager from automatically recalculating system priorities for jobs.
                              v (Optional) Set the SYSPRIO_THRESHOLD_TO_IGNORE_STEP keyword to
                                 specify a threshold value. If the system priority assigned to a job step is less
                                 than this threshold value, the job will remain idle.
                        4.    Issue the llctl command with either the reconfig or recycle keyword.
                              Otherwise, LoadLeveler will not process the modifications you made to the
                              configuration file.
                        5.    (Optional) To make sure that the central manager’s automatic scheduling
                              activity has been disabled (by setting the NEGOTIATOR_INTERVAL keyword
                              to zero), use the llstatus command.
                        6.    Run your program under a user ID with administrator authority.

                        Once this procedure is complete, you might want to use one or more of the
                        following commands to make sure that jobs are scheduled according to the correct
                        system priority. The value of q_sysprio in command output indicates the system
                        priority for the job step.
                        v Use the command llq -s to detect whether a job step is idle because its system
                           priority is below the value set for the
                           SYSPRIO_THRESHOLD_TO_IGNORE_STEP keyword.
                        v Use the command llq -l to display the previous system priority for a job step.
                        v When unusual circumstances require you to change system priorities manually:
                           1. Use the command llmodify -s to set the system priority for LoadLeveler jobs.
                               The values you set through this command will not be readjusted when the
                               negotiator recalculates job step priorities.
                             2. Use the llrunscheduler command to instruct the central manager to run the
                                scheduling algorithm.

             Handling DCE security credentials
                        You can write a pair of programs to override the default LoadLeveler DCE
                        authentication method.

                        To enable the programs, use the DCE_AUTHENTICATION_PAIR keyword in
                        your configuration file:

74   TWS LoadLeveler: Using and Administering
v As an alternative, you can also specify the program pair:
        DCE_AUTHENTICATION_PAIR = $(BIN)/llgetdce, $(BIN)/llsetdce

      Specifying the DCE_AUTHENTICATION_PAIR keyword enables LoadLeveler
      support for forwarding DCE credentials to LoadLeveler jobs. You may override the
      default function provided by LoadLeveler to establish DCE credentials by
      substituting your own programs.

      Using the alternative program pair: llgetdce and llsetdce
      The program pair, llgetdce and llsetdce, forwards DCE credentials by copying
      credential cache files from the submitting machine to the executing machines.

      While this technique may require less overhead, it has been known to produce
      credentials on the executing machines which are not fully capable of being
      forwarded by rsh commands. This is the only pair of programs offered in earlier
      releases of LoadLeveler.

      Forwarding DCE credentials
      An example of a credentials object is a character string containing the DCE
      principle name and a password.

      program1 writes the following to standard output:
      v The length of the handle to follow
      v The handle

      If program1 encounters errors, it writes error messages to standard error.

      program2 receives the following as standard input:
      v The length of the handle to follow
      v The same handle written by program1

      program2 writes the following to standard output:
      v The length of the login context to follow
      v An exportable DCE login context, which is the idl_byte array produced from the
        sec_login_export_context DCE API call. For more information, see the DCE
        Security Services API chapter in the Distributed Computing Environment for AIX:
        Application Development Reference.
      v A character string suitable for assigning to the KRB5CCNAME environment
        variable This string represents the location of the credentials cache established in
        order for program2 to export the DCE login context.

      If program2 encounters errors, it writes error messages to standard error. The parent
      process, the LoadLeveler starter process, writes those messages to the starter log.

      For examples of programs that enable DCE security credentials, see the
      samples/lldce subdirectory in the release directory.

Handling an AFS token
      You can write a program, run by the scheduler, to refresh an AFS token when a job
      is started.

      To invoke the program, use the AFS_GETNEWTOKEN keyword in your
      configuration file.


                                           Chapter 4. Configuring the LoadLeveler environment   75
Before running the program, LoadLeveler sets up standard input and standard
                        output as pipes between the program and LoadLeveler. LoadLeveler also sets up
                        the following environment variables:
                        LOADL_STEP_OWNER
                                 The owner (UNIX user name) of the job
                        LOADL_STEP_COMMAND
                                 The name of the command the user’s job step invokes.
                        LOADL_STEP_CLASS
                                 The class this job step will run.
                        LOADL_STEP_ID
                                 The step identifier, generated by LoadLeveler.
                        LOADL_JOB_CPU_LIMIT
                                 The number of CPU seconds the job is limited to.
                        LOADL_WALL_LIMIT
                                 The number of wall clock seconds the job is limited to.

                        LoadLeveler writes the following current AFS credentials, in order, over the
                        standard input pipe:
                        v The ktc_principal structure indicating the service.
                        v The ktc_principal structure indicating the client.
                        v The ktc_token structure containing the credentials.

                        The ktc_principal structure is defined in the AFS header file afs_rxkad.h. The
                        ktc_token structure is defined in the AFS header file afs_auth.h.

                        LoadLeveler expects to read these same structures in the same order from the
                        standard output pipe, except these should be refreshed credentials produced by the
                        installation exit.

                        The installation exit can modify the passed credentials (to extend their lifetime)
                        and pass them back, or it can obtain new credentials. LoadLeveler takes whatever
                        is returned and uses it to authenticate the user prior to starting the user’s job.

             Filtering a job script
                        You can write a program to filter a job script when the job is submitted to the local
                        cluster and when the job is submitted from a remote cluster.

                        This program can, for example, modify defaults or perform site specific verification
                        of parameters. To invoke the local job filter, specify the SUBMIT_FILTER keyword
                        in your configuration file. To invoke the remote job filter, specify the
                        CLUSTER_REMOTE_JOB_FILTER keyword in your configuration file. For more
                        information on these keywords, see the SUBMIT_FILTER or
                        CLUSTER_REMOTE_JOB_FILTER keyword in Chapter 12, “Configuration file
                        reference,” on page 263.

                        LoadLeveler sets the following environment variables when the program is
                        invoked:
                        LOADL_ACTIVE
                               LoadLeveler version
                        LOADL_STEP_COMMAND
                               Job command file name
                        LOADL_STEP_ID
                               The job identifier, generated by LoadLeveler
                        LOADL_STEP_OWNER
                               The owner (UNIX user name) of the job


76   TWS LoadLeveler: Using and Administering
For details about specific keyword syntax and use in the configuration file, see
      Chapter 12, “Configuration file reference,” on page 263.

Writing prolog and epilog programs
      An administrator can write prolog and epilog installation exits that can run before
      and after a LoadLeveler job runs, respectively.

      Prolog and epilog programs fall into two types:
      v Those that run as the LoadLeveler user ID.
      v Those that run in a user’s environment.

      Depending on the type of processing you want to perform before or after a job
      runs, specify one or more of the following configuration file keywords, in any
      combination:
      v To run a prolog or epilog program under the LoadLeveler user ID, specify
        JOB_PROLOG or JOB_EPILOG, respectively.
      v To run a prolog or epilog program under the user’s environment, specify
        JOB_USER_PROLOG or JOB_USER_EPILOG, respectively.
      You do not have to provide a prolog/epilog pair of programs. You may, for
      example, use only a prolog program that runs under the LoadLeveler user ID.

      For details about specific keyword syntax and use in the configuration file, see
      Chapter 12, “Configuration file reference,” on page 263.

      Note: If process tracking is enabled and your prolog or epilog program invokes
            the mailx command, set the sendwait variable to prevent the background
            mail process from being killed by process tracking.

      A user environment prolog or epilog runs with AFS authentication if installed and
      enabled. For security reasons, you must code these programs on the machines
      where the job runs and on the machine that schedules the job. If you do not define
      a value for these keywords, the user environment prolog and epilog settings on the
      executing machine are ignored.

      The user environment prolog and epilog can set environment variables for the job
      by sending information to standard output in the following format:
      env id = value

      Where:
      id     Is the name of the environment variable
      value Is the value (setting) of the environment variable

      For example, the user environment prolog sets the environment variable
      STAGE_HOST for the job:
      #!/bin/sh
      echo env STAGE_HOST=shd22

      Coding conventions for prolog programs
      The prolog program is invoked by the starter process.

      Once the starter process invokes the prolog program, the program obtains
      information about the job from environment variables.

      Syntax:

                                           Chapter 4. Configuring the LoadLeveler environment   77
prolog_program

                        Where prolog_program is the name of the prolog program as defined in the
                        JOB_PROLOG keyword.

                        No arguments are passed to the program, but several environment variables are
                        set. For more information on these environment variables, see “Run-time
                        environment variables” on page 400.

                        The real and effective user ID of the prolog process is the LoadLeveler user ID. If
                        the prolog program requires root authority, the administrator must write a secure
                        C or Perl program to perform the desired actions. You should not use shell scripts
                        with set uid permissions, since these scripts may make your system susceptible to
                        security problems.

                        Return code values:
                        0      The job will begin.

                        If the prolog program is ended with a signal, the job does not begin and a message
                        is written to the starter log.

                        Sample prolog programs:
                        v Sample of a prolog program for korn shell:
                           #!/bin/ksh
                           #
                           # Set up environment
                           set -a
                           . /etc/environment
                           . /.profile
                           export PATH="$PATH:/loctools/lladmin/bin"
                           export LOG="/tmp/$LOADL_STEP_OWNER.$LOADL_STEP_ID.prolog"
                           #
                           # Do set up based upon job step class
                           #
                           case $LOADL_STEP_CLASS in
                               # A OSL job is about to run, make sure the osl filesystem is
                               # mounted. If status is negative then filesystem cannot be
                               # mounted and the job step should not run.
                               "OSL")
                                 mount_osl_files >> $LOG
                               if [ status = 0 ]
                                    then EXIT_CODE=1
                                 else
                                    EXIT_CODE=0
                                 fi
                                 ;;
                           # A simulation job is about to run, simulation data has to
                           # be made available to the job. The status from copy script must
                           # be zero or job step cannot run.
                           "sim")

                                 copy_sim_data >> $LOG
                           if [ status = 0 ]
                                    then EXIT_CODE=0
                                 else
                                    EXIT_CODE=1
                                 fi
                                 ;;
                           # All other job will require free space in /tmp, make sure
                           # enough space is available.
                           *)
                                 check_tmp >> $LOG

78   TWS LoadLeveler: Using and Administering
EXIT_CODE=$?
          ;;
  esac
  # The job step will run only if EXIT_CODE == 0
  exit $EXIT_CODE
v Sample of a prolog program for C shell:
  #!/bin/csh
  #
  # Set up environment
  source /u/loadl/.login
  #
  setenv PATH "${PATH}:/loctools/lladmin/bin"
  setenv LOG "/tmp/${LOADL_STEP_OWNER}.${LOADL_STEP_ID}.prolog"
  #
  # Do set up based upon job step class
  #
  switch ($LOADL_STEP_CLASS)
      # A OSL job is about to run, make sure the osl filesystem is
      # mounted. If status is negative then filesystem cannot be
      # mounted and the job step should not run.
      case "OSL":
        mount_osl_files >> $LOG
        if ($status < 0 ) then
          set EXIT_CODE = 1
        else
          set EXIT_CODE = 0
        endif
        breaksw
  # A simulation job is about to run, simulation data has to
  # be made available to the job. The status from copy script must
  # be zero or job step cannot run.
  case "sim":
      copy_sim_data >> $LOG
      if ($status == 0 ) then
        set EXIT_CODE = 0
      else
        set EXIT_CODE = 1
      endif
      breaksw
  # All other job will require free space in /tmp, make sure
  # enough space is available.
  default:
      check_tmp >> $LOG
      set EXIT_CODE = $status
      breaksw
  endsw

  # The job step will run only if EXIT_CODE == 0
  exit $EXIT_CODE

Coding conventions for epilog programs
The installation defined epilog program is invoked after a job step has completed.

The purpose of the epilog program is to perform any required clean up such as
unmounting file systems, removing files, and copying results. The exit status of
both the prolog program and the job step is set in environment variables.

Syntax:
epilog_program

Where epilog_program is the name of the epilog program as defined in the
JOB_EPILOG keyword.


                                    Chapter 4. Configuring the LoadLeveler environment   79
No arguments are passed to the program but several environment variables are set.
                        These environment variables are described in “Run-time environment variables” on
                        page 400. In addition, the following environment variables are set for the epilog
                        programs:
                        LOADL_PROLOG_EXIT_CODE
                             The exit code from the prolog program. This environment variable is set
                             only if a prolog program is configured to run.
                        LOADL_USER_PROLOG_EXIT_CODE
                             The exit code from the user prolog program. This environment variable is
                             set only if a user prolog program is configured to run.
                        LOADL_JOB_STEP_EXIT_CODE
                             The exit code from the job step.

                        Note: To interpret the exit status of the prolog program and the job step, convert
                              the string to an integer and use the macros found in the sys/wait.h file.
                              These macros include:
                              v WEXITSTATUS: gives you the exit code
                              v WTERMSIG: gives you the signal that terminated the program
                              v WIFEXITED: tells you if the program exited
                              v WIFSIGNALED: tells you if the program was terminated by a signal

                                The exit codes returned by the WEXITSTATUS macro are the valid codes.
                                However, if you look at the raw numbers in sys/wait.h, the exit code may
                                appear to be 256 times the expected return code. The numbers in sys/wait.h
                                are the wait3 system calls.

                                Sample epilog programs:
                                v Sample of an epilog program for korn shell:
                                  #!/bin/ksh
                                  #
                                  # Set up environment
                                  set -a
                                  . /etc/environment
                                  . /.profile
                                  export PATH="$PATH:/loctools/lladmin/bin"
                                  export LOG="/tmp/$LOADL_STEP_OWNER.$LOADL_STEP_ID.epilog"
                                  #
                                  if [ [ -z $LOADL_PROLOG_EXIT_CODE ] ]
                                  then
                                  echo "Prolog did not run" >> $LOG
                                  else
                                  echo "Prolog exit code = $LOADL_PROLOG_EXIT_CODE" >> $LOG
                                  fi
                                  #
                                  if [ [ -z $LOADL_USER_PROLOG_EXIT_CODE ] ]
                                    then
                                     echo "User environment prolog did not run" >> $LOG
                                    else
                                     echo "User environment exit code = $LOADL_USER_PROLOG_EXIT_CODE" >> $LOG
                                  fi
                                  #
                                  if [ [ -z $LOADL_JOB_STEP_EXIT_CODE ] ]
                                    then
                                     echo "Job step did not run" >> $LOG
                                    else
                                     echo "Job step exit code = $LOADL_JOB_STEP_EXIT_CODE" >> $LOG
                                  fi
                                  #
                                  #
                                  # Do clean up based upon job step class
                                  #
                                  case $LOADL_STEP_CLASS in
                                    # A OSL job just ran, unmount the filesystem.
                                    "OSL")
                                      umount_osl_files >> $LOG


80   TWS LoadLeveler: Using and Administering
;;
                # A simulation job just ran, remove input files.
                # Copy results if simulation was successful (second argument
                # contains exit status from job step).
                "sim")
                  rm_sim_data >> $LOG
                  if [ $2 = 0 ]
                    then copy_sim_results >> $LOG
                  fi
                  ;;
              # Clean up /tmp
              *)
                clean_tmp >> $LOG
                ;;
              esac
            v Sample of an epilog program for C shell:
              #!/bin/csh
              #
              # Set up environment
              source /u/loadl/.login
              #
              setenv PATH "${PATH}:/loctools/lladmin/bin"
              setenv LOG "/tmp/${LOADL_STEP_OWNER}.${LOADL_STEP_ID}.prolog"
              #
              if ( ${?LOADL_PROLOG_EXIT_CODE} ) then
              echo "Prolog exit code = $LOADL_PROLOG_EXIT_CODE" >> $LOG
              else
              echo "Prolog did not run" >> $LOG
              endif
              #
              if ( ${?LOADL_USER_PROLOG_EXIT_CODE} ) then
                  echo "User environment exit code = $LOADL_USER_PROLOG_EXIT_CODE" >> $LOG
                else
                  echo "User environment prolog did not run" >> $LOG
              endif
              #
              if ( ${?LOADL_JOB_STEP_EXIT_CODE} ) then
                  echo "Job step exit code = $LOADL_JOB_STEP_EXIT_CODE" >> $LOG
                else
                  echo "Job step did not run" >> $LOG
              endif
              #
              # Do clean up based upon job step class
              #
              switch ($LOADL_STEP_CLASS)
                # A OSL job just ran, unmount the filesystem.
                case "OSL":
                  umount_osl_files >> $LOG
                  breaksw
              # A simulation job just ran, remove input files.
              # Copy results if simulation was successful (second argument
              # contains exit status from job step).
              case "sim":
                rm_sim_data >> $LOG
                if ($argv{2} == 0 ) then
                  copy_sim_results >> $LOG
                endif
                breaksw
              # Clean up /tmp
              default:
                clean_tmp >> $LOG
                breaksw
              endsw


Using your own mail program
      You can write a program to override the LoadLeveler default mail notification
      method.

      You can use this program, for example, to display your own messages to users
      when a job completes, or to automate tasks such as sending error messages to a
      network manager.




                                              Chapter 4. Configuring the LoadLeveler environment   81
The syntax for the program is the same as it is for standard UNIX mail programs;
                        the command is called with the following arguments:
                        v -s to indicate a subject.
                        v A pointer to a string containing the subject.
                        v A pointer to a string containing a list of mail recipients.
                        The mail message is taken from standard input.

                        To enable this program to replace the default mail notification method, use the
                        MAIL keyword in the configuration file. For details about specific keyword syntax
                        and use in the configuration file, see Chapter 12, “Configuration file reference,” on
                        page 263.




82   TWS LoadLeveler: Using and Administering
Chapter 5. Defining LoadLeveler resources to administer
               After installing LoadLeveler, you may customize it by modifying the
               administration file.

               The administration file optionally lists and defines the machines in the
               LoadLeveler cluster and the characteristics of classes, users, and groups.

               LoadLeveler does not prevent you from having multiple copies of administration
               files, but you need to be sure to update all the copies whenever you make a
               change to one. Having only one administration file prevents any confusion.

               Table 18 lists the LoadLeveler resources you may define by modifying the
               administration file.
               Table 18. Roadmap of tasks for modifying the LoadLeveler administration file
               To learn about:                 Read the following:
               Modifying the administration    “Steps for modifying an administration file”
               file
               Defining LoadLeveler            v “Defining machines” on page 84
               resources to administer
                                               v “Defining adapters” on page 86
                                               v “Defining classes” on page 89
                                               v “Defining users” on page 97
                                               v “Defining groups” on page 99
                                               v “Defining clusters” on page 100
               Correctly specifying            Chapter 13, “Administration file reference,” on page 321
               administration file keywords



Steps for modifying an administration file
               All LoadLeveler commands, daemons, and processes read the administration and
               configuration files at start up time.

               If you change the administration or configuration files after LoadLeveler has
               already started, any LoadLeveler command or process, such as the LoadL_starter
               process, will read the newer version of the files while the running daemons will
               continue to use the data from the older version. To ensure that all LoadLeveler
               commands, daemons, and processes use the same configuration data, run the
               reconfiguration command on all machines in the cluster each time the
               administration or configuration files are changed.

               Before you begin: You need to:
               v Ensure that the installation procedure has completed successfully and that the
                 administration file, LoadL_admin, exists in LoadLeveler’s home directory. For
                 additional details about installation, see TWS LoadLeveler: Installation Guide.
               v Know how to correctly specify keywords in the administration file. For
                 information about administration file keyword syntax and other details, see
                 Chapter 13, “Administration file reference,” on page 321.



                                                                                                          83
v (Optional) Know how to correctly issue the llextRPD command, if you choose to
                          use it (see “llextRPD - Extract data from an RSCT peer domain” on page 443).

                        Perform the following steps to modify the administration file, LoadL_admin:
                        1. Identify yourself as a LoadLeveler administrator using the LOADL_ADMIN
                           keyword.
                        2. Provide the following stanza types in the administration file:
                           v One machine stanza to define the central manager for the LoadLeveler
                              cluster. You also may create machine stanzas for other machines in the
                              LoadLeveler cluster.
                              You can use the llextRPD command to automatically create machine stanzas.
                           v (Optional) An adapter stanza for each type of network adapter that you want
                              LoadLeveler jobs to be able to request.
                              You can use the llextRPD command to automatically create adapter stanzas.
                        3. (Optional) Specify one or more of the following stanza types:
                           v A class stanza for each set of LoadLeveler jobs that have similar
                              characteristics or resource requirements.
                           v A user stanza for specific users, if their requirements do not match those
                              characteristics defined in the default user stanza.
                           v A group stanza for each set of LoadLeveler users that have similar
                              characteristics or resource requirements.
                        4. (Optional) You may specify values for additional administration file keywords,
                           which are listed and described in “Administration file keyword descriptions”
                           on page 327.
                        5. Notify LoadLeveler daemons by issuing the llctl command with either the
                           reconfig or recycle keyword. Otherwise, LoadLeveler will not process the
                           modifications you made to the administration file.

Defining machines
                        The information in a machine stanza defines the characteristics of that machine.

                        You do not have to specify a machine stanza for every machine in the LoadLeveler
                        cluster, but you must have one machine stanza for the machine that will serve as
                        the central manager.

                        If you do not specify a machine stanza for a machine in the cluster, the machine
                        and the central manager still communicate and jobs are scheduled on the machine
                        but the machine is assigned the default values specified in the default machine
                        stanza. If there is no default stanza, the machine is assigned default values set by
                        LoadLeveler.

                        Any machine name used in the stanza must be a name which can be resolved to
                        an IP address. This name is referred to as an interface name because the name can
                        be used for a program to interface with the machine. Generally, interface names
                        match the machine name, but they do not have to.

                        By default, LoadLeveler will append the DNS domain name to the end of any
                        machine name without a domain name appended before resolving its address. If
                        you specify a machine name without a domain name appended to it and you do
                        not want LoadLeveler to append the DNS domain name to it, specify the name
                        using a trailing period. You may have a need to specify machine names in this way
                        if you are running a cluster with more than one nameserving technique. For

84   TWS LoadLeveler: Using and Administering
example, if you are using a DNS nameserver and running NIS, you may have
      some machine names which are resolved by NIS which you do not want
      LoadLeveler to append DNS names to. In situations such as this, you also want to
      specify name_server keyword in your machine stanzas.

      Under the following conditions, you must have a machine stanza for the machine
      in question:
      v If you set the MACHINE_AUTHENTICATE keyword to true in the
         configuration file, then you must create a machine stanza for each node that
         LoadLeveler includes in the cluster.
      v If the machine’s hostname (the name of the machine returned by the UNIX
         hostname command) does not match an interface name. In this case, you must
         specify the interface name as the machine stanza name and specify the
         machine’s hostname using the alias keyword.
      v If the machine’s hostname does match an interface name but not the correct
         interface name.

      For information about automatically creating machine stanzas, see “llextRPD -
      Extract data from an RSCT peer domain” on page 443.

Planning considerations for defining machines
      There are several matters to consider before customizing the administration file.

      Before customizing the administration file, consider the following:
      v Node availability
        Some workstation owners might agree to accept LoadLeveler jobs only when
        they are not using the workstation themselves. Using LoadLeveler keywords,
        these workstations can be configured to be available at designated times only.
      v Common name space
        To run jobs on any machine in the LoadLeveler cluster, a user needs the same
        uid (the user ID number for a user) and gid (the group ID number for a group)
        on every machine in the cluster.
        For example, if there are two machines in your LoadLeveler cluster, machine_1
        and machine_2, user john must have the same user ID and login group ID in the
        /etc/passwd file on both machines. If user john has user ID 1234 and login group
        ID 100 on machine_1, then user john must have the same user ID and login
        group ID in /etc/passwd on machine_2. (LoadLeveler requires a job to run with
        the same group ID and user ID of the person who submitted the job.)
        If you do not have a user ID on one machine, your jobs will not run on that
        machine. Also, many commands, such as llq, will not work correctly if a user
        does not have a user ID on the central manager machine.
        However, there are cases where you may choose to not give a user a login ID on
        a particular machine. For example, a user does not need an ID on every
        submit-only machine; the user only needs to be able to submit jobs from at least
        one such machine. Also, you may choose to restrict a user’s access to a Schedd
        machine that is not a public scheduler; again, the user only needs access to at
        least one Schedd machine.
      v Resource handling
        Some nodes in the LoadLeveler cluster might have special software installed that
        users might need to run their jobs successfully. You should configure
        LoadLeveler to distinguish those nodes from other nodes using, for example,
        machine features.


                                       Chapter 5. Defining LoadLeveler resources to administer   85
Machine stanza format and keyword summary
                        Machine stanzas take the following format.

                        Default values for keywords appear in bold:


                        label: type = machine
                        adapter_stanzas = stanza_list
                        alias = machine_name
                        central_manager = true | false | alt
                        cpu_speed_scale = true | false
                        machine_mode = batch | interactive | general
                        master_node_exclusive = true | false
                        max_jobs_scheduled = number
                        name_server = list
                        pool_list = pool_numbers
                        reservation_permitted = true | false
                        resources = name(count) name(count) ... name(count)
                        schedd_fenced = true | false
                        schedd_host = true | false
                        speed = number
                        submit_only = true | false

                        Figure 11. Format of a machine stanza

             Examples: Machine stanzas
                        These machine stanza examples may apply to your situation.
                        v Example 1
                          In this example, the machine is being defined as the central manager.
                           #
                           machine_a: type = machine
                           central_manager = true    # central manager runs here
                        v Example 2
                          This example sets up a submit-only node. Note that the submit-only keyword in
                          the example is set to true, while the schedd_host keyword is set to false. You
                          must also ensure that you set the schedd_host to true on at least one other node
                          in the cluster.
                           #
                           machine_b: type = machine
                           central_manager = false     #   not the central manager
                           schedd_host = false         #   not a scheduling machine
                           submit_only = true          #   submit only machine
                           alias = machineb            #   interface name
                        v Example 3
                          In the following example, machine_c is the central manager and has an alias
                          associated with it:
                           #
                           machine_c: type = machine
                           central_manager = true    # central manager runs here
                           schedd_host = true        # defines a public scheduler
                           alias = brianne


Defining adapters
                        An adapter stanza identifies network adapters that are available on the machines
                        in the LoadLeveler cluster.



86   TWS LoadLeveler: Using and Administering
If you want LoadLeveler jobs to be able to request specific adapters, you must
          either specify adapter stanzas or configure dynamic adapters in the administration
          file.

          Note the following when using an adapter stanza:
          v An adapter stanza is required for each adapter stanza name you specify on the
            adapter_stanzas keyword of the machine stanza.
          v The adapter_name, interface_address and interface_name keywords are
            required.

          For information about creating adapter stanzas, see “llextRPD - Extract data from
          an RSCT peer domain” on page 443 for peer domains.

    Configuring dynamic adapters
          LoadLeveler can dynamically determine the adapters in any operating system
          instance (OSI) that has RSCT installed.

          LoadLeveler must be told on an OSI basis if it is to handle dynamic adapter
          configuration changes for that OSI. The specification of whether to use dynamic or
          static adapter configuration for an OSI is done through the presence or absence of
          the machine stanza’s adapter_stanzas keyword.

          If a machine stanza in the administration file contains an adapter_stanzas
          statement then this is taken as a directive by the LoadLeveler administrator to use
          only those specified adapters. For this OSI, LoadLeveler will not perform any
          dynamic adapter configuration or processing. If an adapter change occurs in this
          OSI then the administrator will have to make the corresponding change in the
          administration file and then stop and restart or reconfigure the LoadLeveler startd
          daemon to pick up the adapter changes. If an OSI (machine stanza) in the
          administration file does not contain the adapter_stanzas keyword then this is taken
          as a directive by the LoadLeveler administrator for LoadLeveler to dynamically
          configure the adapters for that OSI. For that OSI, LoadLeveler will determine what
          adapters are present at startup via calls to the RMCAPI. If an adapter change
          occurs during execution in the OSI then LoadLeveler will automatically detect and
          handle the change without requiring a restart or reconfiguration.

    Configuring InfiniBand adapters
          InfiniBand adapters, known as host channel adapters (HCAs) can be multiported.

          Tasks can use ports of an HCA independently, which allows them to be allocated
          by the scheduling algorithm independently.

|         Note: InfiniBand adapters are supported on the AIX operating system and in SUSE
|               Linux Enterprise Server (SLES) 9 and SLES 10 on TWS LoadLeveler for
|               POWER clusters.

          An InfiniBand adapter can have multiple adapter ports. Each port on the
          InfiniBand adapter can be connected to one network and will be managed by TWS
          LoadLeveler as a switch adapter. InfiniBand adapter ports derive their resources
          and usage state from the InfiniBand adapter with which they are associated, but
          are allocated to jobs separately.

          If you want LoadLeveler jobs to be able to request InfiniBand adapters, you must
          either specify adapter stanzas or configure dynamic adapters in the administration


                                           Chapter 5. Defining LoadLeveler resources to administer   87
file. The InfiniBand ports are identified to TWS LoadLeveler in the same way other
                        adapters are. Stanzas are specified in the administration file if static adapters are
                        used and the ports are discovered by RSCT if dynamic adapters are used.

                        The port_number administration keyword has been added to support an
                        InfiniBand port. The port_number keyword specifies the port number of the
                        InfiniBand adapter port. Only InfiniBand ports are managed and displayed by
                        TWS LoadLeveler; the InfiniBand adapter itself is not. The adapter stanza for
                        InfiniBand support only contains the adapter port information. There is no
                        InfiniBand adapter information in the adapter stanza (see example 2 in “Examples:
                        Adapter stanzas” on page 89).

                        Note:
                                1. TWS LoadLeveler distributes the switch adapter windows of the
                                   InfiniBand adapter equally among its ports and the allocation is not
                                   adjusted should all of the resources on one port be consumed.
                                2. The InfiniBand ports determine their usage state and availability from
                                   their InfiniBand adapter. If one port is in use exclusively, no other ports
                                   on the adapter can be used for any other job.
                                3. If you have a mixed cluster where some nodes use the InfiniBand
                                   adapter and some nodes use the HPS adapter, you have to organize the
                                   nodes into pools so that the job is only dispatched to nodes with the
                                   same kind of switch adapter.
                                4. There is no change to the way the InfiniBand adapters are requested on
                                   the job command file network statement; that is, InfiniBand adapters are
                                   requested the same way as any other adapter would be.
                                5. Because InfiniBand adapters do not support rCxt blocks, jobs that would
                                   otherwise use InfiniBand adapters, but which also request rCxt blocks
                                   with the rcxtblks keyword on the network statement will remain in the
                                   idle state. This behavior is consistent with how other adapters (for
                                   example, the HPS) behave in the same situation. You can use the llstatus
                                   -a command to see rCxt blocks on adapters (see “llstatus - Query
                                   machine status” on page 512 for more information).

             Adapter stanza format and keyword summary
                        Consider this format of an adapter stanza.

                        An adapter stanza has the following format:


                        label: type = adapter
                        adapter_name = name
                        adapter_type = type
                        device_driver_name = name
                        interface_address = IP_address
                        interface_name = name
                        logical_id = id
                        multilink_address = ip_address
                        multilink_list = adapter_name <, adapter_name>*
                        network_id = id
                        network_type = type
                        port_number = number
                        switch_node_number = integer

                        Figure 12. Format of an adapter stanza




88   TWS LoadLeveler: Using and Administering
Examples: Adapter stanzas
              These adapter stanza examples may apply to your situation.
              v Example 1: Specifying an HPS adapter
                   In the following example, the adapter stanza called
                   “c121s0n10.ppd.pok.ibm.com” specifies an HPS adapter. Note that
                   c121s0n10.ppd.pok.ibm.com is also specified on the adapter_stanzas keyword of
                   the machine stanza for the “yugo” machine.
                            yugo:   type=machine
                                    adapter_stanzas = c121s0n10.ppd.pok.ibm.com
                                    ...

                   c121s0n10.ppd.pok.ibm.com: type = adapter
                                  adapter_name = sn0
                                  network_type = switch
                                  interface_address = 192.168.0.10
                                  interface_name = c121s0n10.ppd.pok.ibm.com
                                  multilink_address = 10.10.10.10
                                  logical_id = 2
                                  adapter_type = Switch_Network_Interface_For_HPS
                                  device_driver_name = sni0
                                  network_id = 1

                   c121f2rp02.ppd.pok.ibm.com: type = adapter
                                  adapter_name = en0
                                  network_type = ethernet
                                  interface_address = 9.114.66.74
                                  interface_name = c121f2rp02.ppd.pok.ibm.com
                                  device_driver_name = ent0
              v Example 2: Specifying an InfiniBand adapter
                   In the following example, the port_number specifies the port number of the
                   InfiniBand adapter port:
                   192.168.9.58: type = adapter
                           adapter_name = ib1
                           network_type = InfiniBand
                           interface_address = 192.168.9.58
                           interface_name = 192.168.9.58
                           logical_id = 23
                           adapter_type = InfiniBand
                           device_driver_name = ehca0
                           network_id = 18338657682652659714
                           port_number = 2


Defining classes
              The information in a class stanza defines characteristics for that class.

              These characteristics can include the quantities of consumable resources that may
              be used by a class per machine or cluster.

              Within a class stanza, you can have optional user substanzas that define policies
              that apply to a user’s job steps that need to use this class. For more information
              about user substanzas, see “Defining user substanzas in class stanzas” on page 94.
              For information about user stanzas, see “Defining users” on page 97.

        Using limit keywords
              A limit is the amount of a resource that a job step or a process is allowed to use.
              (A process is a dispatchable unit of work.) A job step may be made up of several
              processes.

                                                  Chapter 5. Defining LoadLeveler resources to administer   89
Limits include both a hard limit and a soft limit. When a hard limit is exceeded,
                        the job is usually terminated. When a soft limit is exceeded, the job is usually
                        given a chance to perform some recovery actions. Limits are enforced either per
                        process or per job step, depending on the type of limit. For parallel jobs steps,
                        which consist of multiple tasks running on multiple machines, limits are enforced
                        on a per task basis.

                        The class stanza includes the limit keywords shown in Table 19, which allow you
                        to control the amount of resources used by a job step or a job process.
                        Table 19. Types of limit keywords
                        Limit                                       How the limit is enforced
                        as_limit                                    Per process
                        ckpt_time_limit                             Per job step
                        core_limit                                  Per process
                        cpu_limit                                   Per process
                        data_limit                                  Per process
                        default_wall_clock_limit                    Per job step
                        file_limit                                  Per process
                        job_cpu_limit                               Per job step
                        locks_limit                                 Per process
                        memlock_limit                               Per process
                        nofile_limit                                Per process
                        nproc_limit                                 Per user
                        rss_limit                                   Per process
                        stack_limit                                 Per process
                        wall_clock_limit                            Per job step


                        For example, a common limit is the cpu_limit, which limits the amount of CPU
                        time a single process can use. If you set cpu_limit to five hours and you have a job
                        step that forks five processes, each process can use up to five hours of CPU time,
                        for a total of 25 CPU hours. Another limit that controls the amount of CPU used is
                        job_cpu_limit. For a serial job step, if you impose a job_cpu_limit of five hours,
                        the entire job step (made up of all five processes) cannot consume more than five
                        CPU hours. For information on using this keyword with parallel jobs, see
                        job_cpu_limit keyword.

                        You can specify limits in either the class stanza of the administration file or in the
                        job command file. The lower of these two limits will be used to run the job even if
                        the system limit for the user is lower. For more information, see:
                        v “Enforcing limits”
                        v “Administration file keyword descriptions” on page 327 or “Job command file
                           keyword descriptions” on page 359

                        Enforcing limits
                        LoadLeveler depends on the underlying operating system to enforce process limits.

                        Users should verify that a process limit such as rss_limit is enforced by the
                        operating system, otherwise setting it in LoadLeveler will have no effect.


90   TWS LoadLeveler: Using and Administering
Exceeding job step limits:

When a hard limit is exceeded LoadLeveler sends a non-trappable signal (except in
the case of a parallel job) to the process group that LoadLeveler created for the job
step.

When a soft limit is exceeded, LoadLeveler sends a trappable signal to the process
group. Any job application that intends to trap a signal sent by LoadLeveler must
ensure that all processes in the process group set up the appropriate signal
handler.

All processes in the job step normally receive the signal. The exception to this rule
is when a child process creates its own process group. That action isolates the
child’s process, and its children, from any signals that LoadLeveler sends. Any
child process creating its own process group is still known to process tracking. So,
if process tracking is enabled, all the child processes are terminated when the main
process terminates.

Table 20 summarizes the actions that the LoadL_starter daemon takes when a job
step limit is exceeded.
Table 20. Enforcing job step limits
Type of Job         When a Soft Limit is Exceeded          When a Hard Limit is Exceeded
Serial              SIGXCPU or SIGKILL issued              SIGKILL issued
Parallel            SIGXCPU issued to both the user        SIGTERM issued
                    program and to the parallel
                    daemon


On systems that do not support SIGXCPU, LoadLeveler does not distinguish
between hard and soft limits. When a soft limit is reached on these platforms,
LoadLeveler issues a SIGKILL.

Enforcing per process limits:

For per process limits, what happens when your job reaches and exceeds either the
soft limit or the hard limit depends on the operating system you are using.

When a job forks a process that exceeds a per process limit, such as the CPU limit,
the operating system (not LoadLeveler) terminates the process by issuing a
SIGXCPU. As a result, you will not see an entry in the LoadLeveler logs indicating
that the process exceeded the limit. The job will complete with a 0 return code.
LoadLeveler can only report the status of any processes it has started.

If you need more specific information, refer to your operating system
documentation.

How LoadLeveler uses hard limits:

Consider these details on how LoadLeveler uses hard limits.




                                      Chapter 5. Defining LoadLeveler resources to administer   91
See Table 21 for more information on specifying limits.
                        Table 21. Setting limits
                        If the hard limit is:                     Then LoadLeveler does the following:
                        Set in both the class stanza and the      Smaller of the two limits is taken into consideration. If
                        job command file                          the smaller limit is the job limit, the job limit is then
                                                                  compared with the user limit set on the machine that
                                                                  runs the job. The smaller of these two values is used.
                                                                  If the limit used is the class limit, the class limit is
                                                                  used without being compared to the machine limit.
                        Not set in either the class stanza or     User per process limit set on the machine that runs
                        the job command file                      the job is used.
                        Set in the job command file and is        The job is not submitted.
                        less than its respective job soft limit
                        Set in the class stanza and is less       Soft limit is adjusted downward to equal the hard
                        than its respective class stanza soft     limit.
                        limit
                        Specified in the job command file         Hard limit must be greater than or equal to the
                                                                  specified soft limit and less than or equal to the limit
                                                                  set by the administrator in the class stanza of the
                                                                  administration file.

                                                                  Note: If the per process limit is not defined in the
                                                                  administration file and the hard limit defined by the
                                                                  user in the job command file is greater than the limit
                                                                  on the executing machine, then the hard limit is set to
                                                                  the machine limit.



             Allowing users to use a class
                        In a class stanza, you may define a list of users or a list of groups to identify those
                        who may use the class.

                        To do so, use the include_users or include_groups keyword, respectively, or you
                        may use both keywords. If you specify both keywords, a particular user must
                        satisfy both the include_users and the include_groups restrictions for the class.
                        This requirement means that a particular user must be defined not only in a User
                        stanza in the administration file, but also in one of the following ways:
                        v The user’s name must appear in the include_users keyword in a Group stanza
                           whose name corresponds to a name in the include_groups keyword of the Class
                           stanza.
                        v The user’s name must appear in the include_groups keyword of the Class
                           stanza. For information about specifying a user name in a group list, see the
                           include_groups keyword description in “Administration file keyword
                           descriptions” on page 327.

             Class stanza format and keyword summary
                        Class stanzas are optional.

                        Class stanzas take the following format. Default values for keywords appear in
                        bold.




92   TWS LoadLeveler: Using and Administering
label: type = class
          admin= list
          allow_scale_across_jobs = true | false
          as_limit= hardlimit,softlimit
          ckpt_dir = directory
          ckpt_time_limit = hardlimit,softlimit
          class_comment = "string"
          core_limit = hardlimit,softlimit
          cpu_limit = hardlimit,softlimit
          data_limit = hardlimit,softlimit
          default_resources = name(count) name(count)...name(count)
          default_node_resources = name(count) name(count)...name(count)
          env_copy = all | master
          exclude_bg = list
          exclude_groups = list
          exclude_users = list
          file_limit = hardlimit,softlimit
          include_bg = list
          include_groups = list
          include_users = list
          job_cpu_limit = hardlimit,softlimit
          locks_limit = hardlimit,softlimit
          master_node_requirement = true | false
          max_node = number
          max_protocol_instances = number
          max_top_dogs = number
          max_total_tasks = number
          maxjobs = number
          memlock_limit = hardlimit,softlimit
          nice = value
          nofile_limit = hardlimit,softlimit
          nproc_limit = hardlimit,softlimit
          priority = number
          rss_limit = hardlimit,softlimit
          smt = yes | no | as_is
          stack_limit = hardlimit,softlimit
|         striping_with_minimum_networks = true | false
          total_tasks = number
          wall_clock_limit = hardlimit,softlimit
          default_wall_clock_limit = hardlimit,softlimit

          Figure 13. Format of a class stanza




    Examples: Class stanzas
          Any of the following class stanza examples may apply to your situation.
          v Example 1: Creating a class that excludes certain users
            class_a: type=class                     # class that excludes users
            priority=10                             # ClassSysprio
            exclude_users=green judy                # Excluded users
          v Example 2: Creating a class for small-size jobs
            small: type=class                                   #   class for small jobs
            priority=80                                         #   ClassSysprio (max=100)
            cpu_limit=00:02:00                                  #   2 minute limit
            data_limit=30mb                                     #   max 30 MB data segment
            default_resources=ConsumbableVirtualMemory(10mb)    #   resources consumed by each
            ConsumableCpus(1) resA(3) floatinglicenseX(1)       #   task of a small job step if
                                                                #   resources are not explicitly
                                                                #   specified in the job command file
            ckpt_time_limit=3:00,2:00                           #   3 minute hardlimit,
                                                                #   2 minute softlimit
            core_limit=10mb                                     #   max 10 MB core file
            file_limit=50mb                                     #   max file size 50 MB



                                                Chapter 5. Defining LoadLeveler resources to administer   93
stack_limit=10mb                              # max stack size 10 MB
                           rss_limit=35mb                                # max resident set size 35 MB
                           include_users = bob sally                     # authorized users
                        v Example 3: Creating a class for medium-size jobs
                           medium: type=class            #   class for medium jobs
                           priority=70                   #   ClassSysprio
                           cpu_limit=00:10:00            #   10 minute run time limit
                           data_limit=80mb,60mb          #   max 80 MB data segment
                                                         #   soft limit 60 MB data segment
                           ckpt_time_limit=5:00,4:30     #   5 minute hardlimit,
                                                         #   4 minute 30 second softlimit to checkpoint
                           core_limit=30mb               #   max 30 MB core file
                           file_limit=80mb               #   max file size 80 MB
                           stack_limit=30mb              #   max stack size 30 MB
                           rss_limit=100mb               #   max resident set size 100 MB
                           job_cpu_limit=1800,1200       #   hard limit is 30 minutes,
                                                         #   soft limit is 20 minutes
                        v Example 4: Creating a class for large-size jobs
                           large: type=class              # class for large jobs
                           priority=60                    # ClassSysprio
                           cpu_limit=00:10:00             # 10 minute run time limit
                           data_limit=120mb               # max 120 MB data segment
                           default_resources=ConsumableVirtualMemory(40mb)          # resources consumed
                           ConsumableCpus(2) resA(8) floatinglicenseX(1) resB(1)    # by each task of
                                                          # a large job step if resources are not
                                                          # explicitly specified in the job command file
                           ckpt_time_limit=7:00,5:00      # 7 minute hardlimit,
                                                          # 5 minute softlimit to checkpoint
                           core_limit=30mb                # max 30 MB core file
                           file_limit=120mb               # max file size 120 MB
                           stack_limit=unlimited          # unlimited stack size
                           rss_limit=150mb                # max resident set size 150 MB
                           job_cpu_limit = 3600,2700      # hard limit 60 minutes
                                                          # soft limit 45 minutes
                           wall_clock_limit=12:00:00,11:59:55 # hard limit is 12 hours
                        v Example 5: Creating a class for master node machines
                           sp-6hr-sp: type=class          #   class for master node machines
                           priority=50                    #   ClassSysprio (max=100)
                           ckpt_time_limit=25:00,20:00    #   25 minute hardlimit,
                                                          #   20 minute softlimit to checkpoint
                           cpu_limit = 06:00:00           #   6 hour limit
                           job_cpu_limit = 06:00:00       #   hard limit is 6 hours
                           core_limit = lmb               #   max 1MB core file
                           master_node_requirement = true #   master node definition
                        v Example 6: Creating a class for MPICH-GM jobs
                           MPICHGM: type=class            # class for MPICH-GM jobs
                           default_resources = gmports(1) # one gmports resource is consumed by each
                                                          # task, if resources are not explicitly
                                                          # specified in the job command file


Defining user substanzas in class stanzas
                        In a class stanza, you might define user substanzas using the same syntax as you
                        would for any stanza in the LoadLeveler administration file.

                        A user substanza within a class stanza defines policies that apply to job steps
                        submitted by that user and belonging to that class. User substanzas are optional
                        and are independent of user stanzas (for information about user stanzas, see
                        “Defining users” on page 97).




94   TWS LoadLeveler: Using and Administering
Class stanzas that contain user substanzas have the following format:

     label: {
            type = class
            label: {
                  type = user
                  maxidle = number
                  maxjobs = number
                  maxqueued = number
                  max_total_tasks = number
             }
     }

     Figure 14. Format of a user substanza

     When defining substanzas within other stanzas, you must use opening and closing
     braces ({ and }) to mark the beginning and the end of the stanza and substanza.
     The only keywords that are supported in a user substanza are type (required),
     maxidle, maxjobs, maxqueued, and max_total_tasks. For detailed descriptions of
     these keywords, see “Administration file keyword descriptions” on page 327.

Examples: Substanzas
     Any of these substanza examples may apply to your situation.

     In the following example, the default machine and class stanzas do not require
     braces, but the parallel class stanza does require them. Without braces to open and
     close the parallel stanza, it would not be clear that the default user and dept_head
     user stanza belong to the parallel class:
     default:
           type = machine
           central_manager = false
           schedd_host = true

     default:
           type = class
           wall_clock_limit = 60:00,30:00

     parallel: {
           type = class

           # Allow at most 50 running jobs for class parallel
           maxjobs = 50

           # Allow at most 10 running jobs for any single
           # user of class parallel
           default: {
                 type = user
                 maxjobs = 10

           }

           # Allow user dept_head to run as many as 20 jobs
           # of class parallel
           dept_head: {type = user
                 maxjobs = 20

           }
     }

     dept_head: type = user
           maxjobs = 30




                                         Chapter 5. Defining LoadLeveler resources to administer   95
When user substanzas are used in class stanzas, a default user substanza can be
                        defined. Each class stanza can have its own default user substanza, and even the
                        default class stanza can have a default user substanza. In this example, the default
                        user substanza in the default class indicates that for any combination of class and
                        user, the limits maxidle=20 and maxqueued=30 apply, and that maxjobs and
                        max_total_tasks are unlimited. Some of these values are overridden in the physics
                        class stanza. Here is an example of how class stanzas can be configured:
                        default: {
                              type = class
                              default: {
                                    type = user
                                    maxidle = 20
                                    maxqueued = 30
                                    maxjobs = -1
                                    max_total_tasks = -1
                              }
                        }
                        physics: {
                              type = class
                              default: {
                                    type = user
                                    maxjobs = 10
                                    max_total_tasks = 128
                              }
                              john: {
                                    type = user
                                    maxidle = 10
                                    maxjobs = 14
                              }
                              jane: {
                                    type = user
                                    max_total_tasks = 192
                              }
                        }

                        In the following example, the physics stanza shows which values are inherited
                        from which stanzas:
                        physics: {
                                 type = class
                                 default: {
                                       type = user
                                       # inherited from default class, default user
                                       # maxidle = 20

                                         # inherited from default class, default user
                                         # maxqueued = 30

                                         # overrides value of -1 in default class, default user
                                         maxjobs = 10

                                         # overrides value of -1 in default class, default user
                                         max_total_tasks = 128
                                  }
                                  john: {
                                        type = user
                                        # overrides value of 10 in default user
                                        maxidle = 10

                                         # inherited from default user, which was inherited
                                         # from default class, default user
                                         # maxqueued = 30

                                         # overrides value of 10 in default user
                                         maxjobs = 14


96   TWS LoadLeveler: Using and Administering
# inherited from default user
                                # max_total_tasks = 128
                         }

                         jane: {
                               type = user
                               # inherited from default user, which was inherited
                               # from default class, default user
                               # maxidle = 20

                               # inherited from default user, which was inherited
                               # from default class, default user
                               # maxqueued = 30

                               # inherited from default user
                               # maxjobs = 10

                               # overrides value of 128 in default user
                               max_total_tasks = 192
                         }
                 }

                 Any user other than john and jane who submits jobs of class physics is subject to
                 the constraints in the default user substanza in the physics class stanza. Should
                 john or jane submit jobs of any class other than physics, they are subject to the
                 constraints in the default user substanza in the default class stanza.

                 In addition to specifying a default user substanza within the default class stanza,
                 an administrator can specify other user substanzas in the default class stanza. It is
                 important to note that all class stanzas will inherit all user substanzas from the
                 default class stanza.

                 Note: An important rule to understand is that a user substanza within a class
                       stanza will inherit its values from the user substanza in the default class
                       stanza first, if a substanza for that user is present. The next location a user
                       substanza inherits values from is the default user substanza within the same
                       class stanza.

                 When no default stanzas or substanzas are provided, the LoadLeveler default for
                 all four keywords is -1 or unlimited.

                 If a user substanza is provided for a user on the class exclude_users list,
                 exclude_users takes precedence and the user substanza will be effectively ignored
                 because that user cannot use the class at all. On the other hand, when
                 include_users is used in a class, the presence of a user substanza implies that the
                 user is permitted to use the class (it is as if the user were present on the
                 include_users list).

Defining users
                 The information specified in a user stanza defines the characteristics of that user.
                 You can have one user stanza for each user but this is not necessary. If an
                 individual user does not have their own user stanza, that user uses the defaults
                 defined in the default user stanza.

        User stanza format and keyword summary
                 User stanzas take a particular format.


                                                   Chapter 5. Defining LoadLeveler resources to administer   97
User stanzas take the following format:

                            label: type = user
                            account = list
                            default_class = list
                            default_group = group name
                            default_interactive_class = class name
                            env_copy = all | master
                            fair_shares = number
                            max_node = number
                            max_reservation_duration = number
|                           max_reservation_expiration = number
                            max_reservations = number
                            max_total_tasks = number
                            maxidle = number
                            maxjobs = number
                            maxqueued = number
                            priority = number
                            total_tasks = number

                            Figure 15. Format of a user stanza

                            For more information about the keywords listed in the user stanza format, see
                            Chapter 13, “Administration file reference,” on page 321.

                 Examples: User stanzas
                            Any of the following user stanzas may apply to your situation.
                            v Example 1
                              In this example, user fred is being provided with a user stanza. User fred’s jobs
                              will have a user priority of 100. If user fred does not specify a job class in the
                              job command file, the default job class class_a will be used. In addition, he can
                              have a maximum of 15 jobs running at the same time.
                               # Define user stanzas
                               fred: type = user
                               priority = 100
                               default_class = class_a
                               maxjobs = 15
                            v Example 2
                              This example explains how a default interactive class for a parallel job is set by
                              presenting a series of user stanzas and class stanzas. This example assumes that
                              users do not specify the LOADL_INTERACTIVE_CLASS environment variable.
                               default: type =user
                                        default_interactive_class = red
                                        default_class = blue

                               carol:    type = user
                                         default_class = single double
                                         default_interactive_class = ijobs

                               steve:    type = user
                                         default_class = single double

                               ijobs:    type = class
                                         wall_clock_limit = 08:00:00

                               red:      type = class
                                         wall_clock_limit = 30:00
                               If the user Carol submits an interactive job, the job is assigned to the default
                               interactive class called ijobs. The job is assigned a wall clock limit of 8 hours. If



    98   TWS LoadLeveler: Using and Administering
the user Steve submits an interactive job, the job is assigned to the red class
                    from the default user stanza. The job is assigned a wall clock limit of 30
                    minutes.
                  v Example 3
                    In this example, Jane’s jobs have a user priority of 50, and if she does not specify
                    a job class in her job command file the default job class small_jobs is used. This
                    user stanza does not specify the maximum number of jobs that Jane can run at
                    the same time so this value defaults to the value defined in the default stanza.
                    Also, suppose Jane is a member of the primary UNIX group “staff.” Jobs
                    submitted by Jane will use the default LoadLeveler group “staff.” Lastly, Jane
                    can use three different account numbers.
                      # Define user stanzas
                      jane: type = user
                      priority = 50
                      default_class = small_jobs
                      default_group = Unix_Group
                      account = dept10 user3 user4


    Defining groups
                  LoadLeveler groups are another way of granting control to the system
                  administrator.

                  Although a LoadLeveler group is independent from a UNIX group, you can
                  configure a LoadLeveler group to have the same users as a UNIX group by using
                  the include_users keyword.

           Group stanza format and keyword summary
                  The information specified in a group stanza defines the characteristics of that
                  group.

                  Group stanzas are optional and take the following format:

                  label: type = group
                  admin = list
                  env_copy = all | master
                  fair_shares = number
                  exclude_users = list
                  include_users = list
                  max_node = number
                  max_reservation_duration = number
|                 max_reservation_expiration = number
                  max_reservations = number
                  max_total_tasks = number
                  maxidle = number
                  maxjobs = number
                  maxqueued = number
                  priority = number
                  total_tasks = number

                  Figure 16. Format of a group stanza

                  For more information about the keywords listed in the group stanza format, see
                  Chapter 13, “Administration file reference,” on page 321.

           Examples: Group stanzas
                  Any of the following group stanzas may apply to your situation.
                  v Example 1

                                                        Chapter 5. Defining LoadLeveler resources to administer   99
In this example, the group name is department_a. The jobs issued by users
                               belonging to this group will have a priority of 80. There are three members in
                               this group.
                               # Define group stanzas
                               department_a: type = group
                               priority = 80
                               include_users = susann holly fran
                            v Example 2
                              In this example, the group called great_lakes has five members and these user’s
                              jobs have a priority of 100:
                               # Define group stanzas
                               great_lakes: type = group
                               priority = 100
                               include_users = huron ontario michigan erie superior


    Defining clusters
                            The cluster stanza defines the LoadLeveler multicluster environment.

                            Any cluster that wants to participate in the multicluster must have cluster stanzas
                            defined for all clusters with which the local cluster interacts. If you have a cluster
                            stanza defined, LoadLeveler is configured to be in the multicluster environment.

                 Cluster stanza format and keyword summary
                            Cluster stanzas are optional.

                            Cluster stanzas take the following format. Default values for keywords appear in
                            bold.

                            The cluster stanza label must define a unique cluster name within the multicluster
                            environment.


                            label: type = cluster
|                           allow_scale_across_jobs = true | false
                            exclude_classes = class_name[(cluster_name)] ...
                            exclude_groups = group_name[(cluster_name)] ...
                            exclude_users = user_name[(cluster_name)] ...
                            inbound_hosts = hostname[(cluster_name)] ...
                            inbound_schedd_port = port_number
                            include_classes = class_name[(cluster_name)] ...
                            include_groups = group_name[(cluster_name)] ...
                            include_users = user_name[(clustername)] ...
                            local = true | false
|                           main_scale_across_cluster = true | false
                            multicluster_security = SSL
                            outbound_hosts = hostname[(cluster_name)] ...
                            secure_schedd_port = port_number
                            ssl_cipher_list = cipher_list

                            Figure 17. Format of a cluster stanza

                 Examples: Cluster stanzas
                            Any of the following cluster stanzas may apply to your situation.




    100   TWS LoadLeveler: Using and Administering
SCHEDD_STREAM_PORT = 1966


                            M1                                     M6                M7


                            M2
  cluster1                                                          cluster3




                                   M3               M4


                                                    M5
                                  cluster2


Figure 18. Multicluster Example

Figure 18 shows a simple multicluster with three clusters defined as members.
Cluster1 has defined an alternate port number for the Schedds running in its
cluster by setting the SCHEDD_STREAM_PORT = 1966. All of the other clusters need to
define what port to use when connecting to the inbound Schedds of cluster1 by
specifying the inbound_schedd_port = 1966 keyword in the cluster1 stanza.
Cluster2 has a single machine connected to cluster1 and 2 machines connected to
cluster3. Cluster3 has a single machine connected to both cluster2 and cluster1.
Each cluster would set the local keyword to true for their cluster stanza in the
cluster’s administration file.

Multicluster with 3 clusters defined as members
cluster1: type=cluster
          outbound_hosts = M2(cluster2) M1(cluster3)
          inbound_hosts = M2(cluster2) M1(cluster3)
          inbound_schedd_port = 1966

cluster2: type=cluster
          outbound_hosts = M3(cluster1) M4(cluster3)
          inbound_hosts = M3(cluster1) M4(cluster3) M5(cluster3)


cluster3: type=cluster
          outbound_hosts = M6
          inbound_hosts = M6




                                  Chapter 5. Defining LoadLeveler resources to administer   101
102   TWS LoadLeveler: Using and Administering
Chapter 6. Performing additional administrator tasks
                 There are additional ways to modify the LoadLeveler environment that either
                 require an administrator.

                 Table 22 lists additional ways to modify the LoadLeveler environment that either
                 require an administrator to customize both the configuration and administration
                 files, or require the use of the LoadLeveler commands or APIs.
                 Table 22. Roadmap of additional administrator tasks
                 To learn about:                       Read the following:
                 Setting up the environment for        “Setting up the environment for parallel jobs” on page
                 parallel jobs                         104
                 Configuring and using an              v “Using the BACKFILL scheduler” on page 110
                 alternative scheduler
                                                       v “Using an external scheduler” on page 115
                                                       v “Example: Changing scheduler types” on page 126
|                Using additional features available   v “Preempting and resuming jobs” on page 126
|                with the BACKFILL scheduler
                                                       v “Configuring LoadLeveler to support reservations”
                                                         on page 131
|                                                      v “Working with reservations” on page 213
|                                                      v “Data staging” on page 113
                 Working with AIX’s workload           “Steps for integrating LoadLeveler with the AIX
                 balancing component                   Workload Manager” on page 137
                 Enabling LoadLeveler’s                “LoadLeveler support for checkpointing jobs” on page
                 checkpoint/restart function           139
                 Enabling LoadLeveler’s affinity       v LoadLeveler scheduling affinity (see “LoadLeveler
                 support                                 scheduling affinity support” on page 146)
                 Enabling LoadLeveler’s                v “LoadLeveler multicluster support” on page 148
                 multicluster support
                                                       v “Configuring a LoadLeveler multicluster” on page
                                                         150
|                                                      v “Scale-across scheduling with multiclusters” on page
|                                                        153
                 Enabling LoadLeveler’s Blue Gene      v “LoadLeveler Blue Gene support” on page 155
                 support
                                                       v “Configuring LoadLeveler Blue Gene support” on
                                                         page 157
                 Enabling LoadLeveler’s fair share     v “Fair share scheduling overview” on page 27
                 scheduling support
                                                       v “Using fair share scheduling” on page 160
                 Moving job records from a down        v “Procedure for recovering a job spool” on page 167
                 Schedd to another Schedd within
                                                       v “llmovespool - Move job records” on page 472
                 the local cluster
                 Correctly specifying configuration    v Chapter 12, “Configuration file reference,” on page
                 and administration file keywords        263
                                                       v Chapter 13, “Administration file reference,” on page
                                                         321
                 Managing LoadLeveler operations



                                                                                                          103
Table 22. Roadmap of additional administrator tasks (continued)
                            To learn about:                     Read the following:

                            v Querying status                   v “llclass - Query class information” on page 433
                                                                v “llq - Query job status” on page 479
                                                                v “llqres - Query a reservation” on page 500
                                                                v “llstatus - Query machine status” on page 512

                            v Changing attributes of submitted v “llfavorjob - Reorder system queue by job” on page
                              jobs                               447
                                                                v “llfavoruser - Reorder system queue by user” on
                                                                  page 449
                                                                v “llmodify - Change attributes of a submitted job
                                                                  step” on page 464
                                                                v “llprio - Change the user priority of submitted job
                                                                  steps” on page 477

                            v Changing the state of submitted   v “llcancel - Cancel a submitted job” on page 421
                              jobs                              v “llhold - Hold or release a submitted job” on page
                                                                  454



    Setting up the environment for parallel jobs
                            Additional administration tasks apply to parallel jobs.

                            This topic describes the following administration tasks that apply to parallel jobs:
                            v Scheduling support
                            v Reducing job launch overhead
                            v Submitting interactive POE jobs
                            v Setting up a class
                            v Setting up a parallel master node
                            v Configuring MPICH jobs
                            v Configuring MVAPICH jobs
                            v Configuring MPICH-GM jobs

                            For information on submitting parallel jobs, see “Working with parallel jobs” on
                            page 194.

                 Scheduling considerations for parallel jobs
|                           For parallel jobs, LoadLeveler supports BACKFILL scheduling for efficient use of
|                           system resources.

                            This scheduler runs both serial and parallel jobs.

                            BACKFILL scheduling also supports:
                            v Multiple tasks per node
                            v Multiple user space tasks per adapter
                            v Preemption

                            Specify the LoadLeveler scheduler using the SCHEDULER_TYPE keyword. For
                            more information on this keyword and supported scheduler types, see “Choosing a
                            scheduler” on page 44.

    104   TWS LoadLeveler: Using and Administering
Steps for reducing job launch overhead for parallel jobs
      Administrators may define a number of LoadLeveler starter processes to be ready
      and waiting to handle job requests.

      Having this pool of ready processes reduces the amount of time LoadLeveler needs
      to prepare jobs to run. You also may control how environment variables are copied
      for a job. Reducing the number of environment variables that LoadLeveler has to
      copy reduces the amount of time LoadLeveler needs to prepare jobs to run.

      Before you begin: You need to know:
      v How many jobs might be starting at the same time. This estimate determines
        how many starter processes to have LoadLeveler start in advance, to be ready
        and waiting for job requests.
      v The type of parallel jobs that typically are used. If IBM Parallel Environment
        (PE) is used for parallel jobs, PE copies the user’s environment to all executing
        nodes. In this case, you may configure LoadLeveler to avoid redundantly
        copying the same environment variables.
      v How to correctly specify configuration keywords. For details about specific
        keyword syntax and use:
        – In the administration file, see Chapter 13, “Administration file reference,” on
           page 321.
        – In the configuration file, see Chapter 12, “Configuration file reference,” on
           page 263.

      Perform the following steps to configure LoadLeveler to reduce job launch
      overhead for parallel jobs.
      1. In the local or global configuration file, specify the number of starter processes
         for LoadLeveler to automatically start before job requests are submitted. Use
         the PRESTARTED_STARTERS keyword to set this value.
         Tip: The default value of 1 should be sufficient for most installations.
      2. If typical parallel jobs use a facility such as Parallel Environment, which copies
         user environment variables to all executing nodes, set the env_copy keyword in
         the class, user, or group stanzas to specify that LoadLeveler only copy user
         environment variables to the master node by default.
         Rules:
         v Users also may set this keyword in the job command file. If the env_copy
            keyword is set in the job command file, that setting overrides any setting in
            the administration file. For more information, see “Step for controlling
            whether LoadLeveler copies environment variables to all executing nodes”
            on page 195.
         v If the env_copy keyword is set in more than one stanza in the administration
            file, LoadLeveler determines the setting to use by examining all values set in
            the applicable stanzas. See the table in theenv_copy administration file
            keyword to determine what value LoadLeveler will use.
      3. Notify LoadLeveler daemons by issuing the llctl command with either the
         reconfig or recycle keyword. Otherwise, LoadLeveler will not process the
         modifications you made to the configuration and administration files.

      When you are done with this procedure, you can use the POE stderr and
      LoadLeveler logs to trace actions during job launch.




                                          Chapter 6. Performing additional administrator tasks   105
Steps for allowing users to submit interactive POE jobs
                        You can set up your system so that users can submit interactive POE jobs to
                        LoadLeveler.

                        Perform the following steps to set up your system so that users can submit
                        interactive POE jobs to LoadLeveler.
                        1. Make sure that you have installed LoadLeveler and defined LoadLeveler
                            administrators. See “Defining LoadLeveler administrators” on page 43 for
                            information on defining LoadLeveler administrators.
                        2. If running user space jobs, LoadLeveler must be configured to use switch
                            adapters. A way to do this is to run the llextRPD command to extract node
                            and adapter information from the RSCT peer domain. See “llextRPD - Extract
                            data from an RSCT peer domain” on page 443 for additional information.
                        3. In the configuration file, define your scheduler to be the LoadLeveler
                            BACKFILL scheduler by specifying SCHEDULER_TYPE = BACKFILL. See
                            “Choosing a scheduler” on page 44 for more information.
                        4. In the administration file, specify batch, interactive, or general use for nodes.
                            You can use the machine_mode keyword in the machine stanza to specify the
                            type of jobs that can run on a node; you must specify either interactive or
                            general if you are going to run interactive jobs.
                        5. In the administration file, configure optional functions, including:
                            v Setting up pools: you can organize nodes into pools by using the pool_list
                               keyword in the machine stanza. See “Defining machines” on page 84 for
                               more information.
                            v Enabling SP™ exclusive use accounting: you can specify that the accounting
                               function on an SP system be informed that a job step has exclusive use of a
                               machine by specifying spacct_exclusive_enable = true in the machine stanza
                               (as shown in the previous example).
                               See “Defining machines” on page 84 for more information on these
                               keywords.
                        6. Consider setting up a class stanza for your interactive POE jobs. See “Setting
                            up a class for parallel jobs” for more information. Define this class to be your
                            default class for interactive jobs by specifying this class name on the
                            default_interactive_class keyword. See “Defining users” on page 97 for more
                            information.

             Setting up a class for parallel jobs
                        To define the characteristics of parallel jobs run by your installation you should set
                        up a class stanza in the administration file and define a class (in the Class
                        statement in the configuration file) for each task you want to run on a node.

                        Suppose your installation plans to submit long-running parallel jobs, and you want
                        to define the following characteristics:
                        v   Only certain users can submit these jobs
                        v   Jobs have a 30 hour run time limit
                        v   A job can request a maximum of 60 nodes and 120 total tasks
                        v   Jobs will have a relatively low run priority

                        The following is a sample class stanza for long-running parallel jobs which takes
                        into account these characteristics:



106   TWS LoadLeveler: Using and Administering
long_parallel: type=class
          wall_clock_limit = 1800
          include_users = jack queen king ace
          priority = 50
          total_tasks = 120
          max_node = 60
          maxjobs = 2

          Note the following about this class stanza:
          v The wall_clock_limit keyword sets a wall clock limit of 1800 seconds (30 hours)
            for jobs in this class
          v The include_users keyword allows four users to submit jobs in this class
          v The priority keyword sets a relative priority of 50 for jobs in this class
          v The total_tasks keyword specifies that a user can request up to 120 total tasks
            for a job in this class
          v The max_node keyword specifies that a user can request up to 60 nodes for a
            job in this class
          v The maxjobs keyword specifies that a maximum of two jobs in this class can run
            simultaneously

          Suppose users need to submit job command files containing the following
          statements:
          node = 30
          tasks_per_node = 4

          In your LoadL_config file, you must code the Class statement such that at least 30
          nodes have four or more long_parallel classes defined. That is, the configuration
          file for each of these nodes must include the following statement:
          Class = { "long_parallel" "long_parallel" "long_parallel" "long_parallel" }

          or
          Class = long_parallel(4)

          For more information, see “Defining LoadLeveler machine characteristics” on page
          54.

|   Striping when some networks fail
|         When multiple networks are configured in a cluster, a job can request striping over
|         the networks by setting sn_all in the network statement in the job command file.
|         The striping_with_minimum_networks administration file keyword in the class
|         stanza is used to tell LoadLeveler how to select nodes for sn_all jobs of a specific
|         class when one or more networks are unavailable. When
|         striping_with_minimum_networks is set to false for a class, LoadLeveler will only
|         select nodes for sn_all jobs of that class where all the networks are up and in the
|         READY state. When striping_with_minimum_networks is set to true, LoadLeveler
|         will select a set of nodes where at least more than half of the networks on the
|         nodes are up and in the READY state.

|         For example, if there are 8 networks connected to a node and
|         striping_with_minimum_networks is set to false, all 8 networks would have to be
|         up and in the READY state to consider that node for sn_all jobs. If
|         striping_with_minimum_networks is set to true, nodes with at least 5 networks
|         up and in the READY state would be considered for sn_all jobs



                                                Chapter 6. Performing additional administrator tasks   107
Setting up a parallel master node
                        LoadLeveler allows you to define a parallel master node that LoadLeveler will use
                        as the first node for a job submitted to a particular class.

                        To set up a parallel master node, code the following keywords in the node’s class
                        and machine stanzas in the administration file:
                        # MACHINE STANZA: (optional)
                        mach1:     type = machine
                        master_node_exclusive = true


                        # CLASS STANZA: (optional)
                        pmv3:      type = class
                        master_node_requirement = true

                        Specifying master_node_requirement = true forces all parallel jobs in this class to
                        use–as their first node–a machine with the master_node_exclusive = true setting.
                        For more information on these keywords, see “Defining machines” on page 84 and
                        “Defining classes” on page 89.

             Configuring LoadLeveler to support MPICH jobs
                        The MPICH package can be configured so that LoadLeveler will be used to spawn
                        all tasks in a MPICH application.

                        Using LoadLeveler to spawn MPICH tasks allows LoadLeveler to accumulate
                        accounting data for the tasks and also allows LoadLeveler to ensure that all tasks
                        are terminated when the job completes.

                        For LoadLeveler to spawn the tasks of a MPICH job, the MPICH package must be
                        configured to use the LoadLeveler llspawn.stdio command when starting tasks. To
                        configure MPICH to use llspawn.stdio, set the environment variable
                        RSHCOMMAND to the location of the llspawn.stdio command and run the
                        configure command for the MPICH package.

                        On Linux systems, enter the following:
                        # export RSHCOMMAND=/opt/ibmll/LoadL/full/bin/llspawn.stdio
                        # ./configure

                        Note: This configuration works on MPICH-1.2.7. Additional documentation for
                              MPICH is available from the Argonne National Laboratory web site at
                              http://www-unix.mcs.anl.gov/mpi/mpich1/.

             Configuring LoadLeveler to support MVAPICH jobs
                        To run MVAPICH jobs under LoadLeveler control, you must specify the llspawn
                        command to replace the default RSHCOMMAND value during software
                        configuration.

                        The compiled MVAPICH implementation code uses the llspawn command to start
                        tasks under LoadLeveler control. This allows LoadLeveler to have total control
                        over the remote tasks for accounting and cleanup.

                        To configure the MVAPICH code to use the llspawn command as
                        RSHCOMMAND, change the mpirun_rsh.c program source code by following
                        these steps before compiling MVAPICH:
                        1. Replace:

108   TWS LoadLeveler: Using and Administering
Void child_handler(int);
         with:
         Void child_handler(int);
         Void term_handler(int);
      2. For Linux, replace:
         #define RSH_CMD “/usr/bin/rsh”
         #define RSH_CMD “/usr/bin/ssh”
         with:
         #define RSH_CMD “/opt/ibmll/LoadL/full/bin/llspawn”
         #define SSH_CMD “/opt/ibmll/LoadL/full/bin/llpsawn”
      3. Replace:
         signal(SIGCHLD, child_handler);
         with:
         signal(SIGCHLD, SIG_IGN);
         signal(SIGTERM, term_handler);
      4. Add the definition for term_handler function at the end:
         Void term_handler(int signal)
         {
           exit(0);
         }

Configuring LoadLeveler to support MPICH-GM jobs
      To run MPICH-GM jobs under LoadLeveler control, you need to configure the
      MPICH-GM implementation you are using by specifying the llspawn command as
      RSHCOMMAND.

      The compiled MPICH-GM implementation code uses the llspawn command to
      start tasks under LoadLeveler control. This allows LoadLeveler to have total
      control over the remote tasks for accounting and cleanup.

      To configure the MPICH-GM code to use the llspawn command as
      RSHCOMMAND, change the mpich.make.gcc script before compiling the
      MPICH-GM:

      Replace:
      Setenv RSHCOMMAND /usr/bin/rsh

      with:
      Setenv RSHCOMMAND /opt/ibmll/LoadL/full/bin/llspawn

      LoadLeveler does not manage the GM ports on the Myrinet switch. For
      LoadLeveler to keep track of the GM ports they must be identified as LoadLeveler
      consumable resources.

      Perform the following steps to use consumable resources to manage GM ports:
      1. Pick a name for the GM port resource.
         Example: As an example, this procedure assumes the name is gmports, but you
         may use another name.
         Tip: Users who submit MPICH-GM jobs need to know the name that you
         define for the GM port resource.
      2. In the LoadLeveler configuration file, specify the GM port resource name on
         the SCHEDULE_BY_RESOURCES keyword.
         Example:

                                           Chapter 6. Performing additional administrator tasks   109
SCHEDULE_BY_RESOURCES = gmports
                           Tip: If the SCHEDULE_BY_RESOURCES keyword already is specified in the
                           configuration file, you can just add the GM port resource name to other values
                           already listed.
                        3. In the administration file, specify how many GM ports are available on each
                           machine. Use the resources keyword to specify the GM port resource name and
                           the number of GM ports.
                           Example:
                            resources=gmports(n)
                            Tips:
                            v The resources keyword also must appear in the job command file for an
                              MPICH-GM job.
                              Example:
                               resources=gmports(1)
                           v To determine the value of n use either the number specified in the GM
                              documentation or the number of GM ports you have successfully used.
                              Certain system configurations may not support all available GM ports, so
                              you might need to specify a lower value for the gmports resource than what
                              is actually available.
                        4. Issue the llctl command with either the reconfig or recycle keyword.
                           Otherwise, LoadLeveler will not process the modifications you made to the
                           configuration and administration files.

                        For information about submitting MPICH-GM jobs, see “Running MPICH,
                        MVAPICH, and MPICH-GM jobs” on page 204.

Using the BACKFILL scheduler
                        The BACKFILL scheduling algorithm in LoadLeveler is designed to maximize the
                        use of resources to achieve the highest system efficiency, while preventing
                        potentially excessive delays in starting jobs with large resource requirements.

                        These large jobs can run because the BACKFILL scheduler does not allow jobs
                        with smaller resource requirements to continuously use up resources before the
                        larger jobs can accumulate enough resources to run. While BACKFILL can be used
                        for both serial and parallel jobs, the potential advantage is greater with parallel
                        jobs.

                        Job steps are arranged in a queue based on their SYSPRIO order as they arrive
                        from the Schedd nodes in the cluster. The queue can be periodically reordered
                        depending on the value of the RECALCULATE_SYSPRIO_INTERVAL keyword.
                        In each dispatching cycle, as determined by the NEGOTIATOR_INTERVAL and
                        NEGOTIATOR_CYCLE_DELAY configuration keywords, the BACKFILL algorithm
                        examines these job steps sequentially in an attempt to find available resources to
                        run each job step, then dispatches those steps to run.

                        Once the BACKFILL algorithm encounters a job step for which it cannot
                        immediately find enough resources, that job step becomes known as a ″top dog″.
                        The BACKFILL algorithm can allocate multiple top dogs in the same dispatch
                        cycle. By using the MAX_TOP_DOGS configuration keyword (for more
                        information, see Chapter 12, “Configuration file reference,” on page 263), you can
                        define the maximum number of top dogs that the central manager will allocate.
                        For each top dog, the BACKFILL algorithm will attempt to calculate the earliest
                        time at which enough resources will become free to run the corresponding top

110   TWS LoadLeveler: Using and Administering
dog. This is based on the assumption that each currently running job step will run
    until its hard wall clock limit is reached and that when a job step terminates, the
    resources which that step has been using will become available.

    The time at which enough currently running job steps will have terminated,
    meaning enough resources have become available to run a top dog, is called top
    dog’s future start time. The future start time of each top dog is effectively
    guaranteed for the remainder of the execution of the BACKFILL algorithm. The
    resources that each top dog will use at its corresponding start time and for its
    duration, as specified by its hard wall clock limit, are reserved (not to be confused
    with the reservation feature available in LoadLeveler).

    Note: A job that is bound to a reservation is not considered for top-dog
          scheduling, so there is no top-dog scheduling performed inside reservations.

    In some cases, it may not be possible to calculate the future start time of a job step.
    Consider, for example, a case where there are 20 nodes in the cluster and a job step
    requires 24 nodes to run. Even when all nodes in the cluster are idle, it will not be
    possible for this job step to run. Only the addition of nodes to the cluster would
    allow the job step to run, and there is no way the BACKFILL algorithm can make
    any assumptions about when that could take place. In situations like this, the job
    step is not considered a ″top dog″, no resources are ″reserved″, and the BACKFILL
    algorithm goes on to the next job step in the queue.

|   The BACKFILL scheduling algorithm classifies job steps into distinct types:
|   REGULAR, TOP DOG, and BACKFILL:
|   v The REGULAR job step is a job step for which enough resources are currently
|     available and no top dogs have yet been allocated.
|   v The TOP DOG job step is a job step for which not enough resources are
|     currently available, but enough resources are available at a future time and one
|     of the following conditions is met:
|     – The TOP DOG job step is not expected to run at a time when any other top
|        dog is expected to run.
|     – If the TOP DOG is expected to run at a time when some other top dogs are
|        expected to run, then it cannot be using resources reserved by such top dogs.
|   v The BACKFILL job step is a job step for which enough resources are currently
|     available and one of the following conditions is met:
|     – The BACKFILL job step is expected to complete before the future start times
|        of all top dogs, based on the hard wall clock limit of the BACKFILL job step.
|     – If the BACKFILL job step is not expected to complete before the future start
|        time of at least one top dog, then it cannot be using resources reserved by the
|        top dogs that are expected to start before BACKFILL job step is expected to
|        complete.

    Table 23 provides a roadmap of BACKFILL scheduler tasks.
    Table 23. Roadmap of BACKFILL scheduler tasks
    Subtask                       Associated instructions (see . . . )
    Configuring the BACKFILL      v “Choosing a scheduler” on page 44
    scheduler
                                  v “Tips for using the BACKFILL scheduler” on page 112
                                  v “Example: BACKFILL scheduling” on page 113




                                         Chapter 6. Performing additional administrator tasks   111
Table 23. Roadmap of BACKFILL scheduler tasks (continued)
                            Subtask                        Associated instructions (see . . . )
                            Using additional LoadLeveler   v “Preempting and resuming jobs” on page 126
                            features available under the
                                                           v “Configuring LoadLeveler to support reservations” on
                            BACKFILL scheduler
                                                             page 131
|                                                          v “Working with reservations” on page 213
|                                                          v “Data staging” on page 113
|                                                          v “Scale-across scheduling with multiclusters” on page 153
                            Use the BACKFILL scheduler     v “llclass - Query class information” on page 433
                            to dispatch and manage jobs
                                                           v “llmodify - Change attributes of a submitted job step” on
                                                             page 464
                                                           v “llpreempt - Preempt a submitted job step” on page 474
                                                           v “llq - Query job status” on page 479
                                                           v “llsubmit - Submit a job” on page 531
                                                           v “Data access API” on page 560
                                                           v “Error handling API” on page 639
                                                           v “ll_modify subroutine” on page 677
                                                           v “ll_preempt subroutine” on page 686



                 Tips for using the BACKFILL scheduler
                            There are a number of essential considerations to make when using the BACKFILL
                            scheduler.

                            Note the following when using the BACKFILL scheduler:
                            v To use this scheduler, either users must set a wall-clock limit in their job
                              command file or the administrator must define a wall-clock limit value for the
                              class to which a job is assigned. Jobs with the wall_clock_limit of unlimited
                              cannot be used to backfill because they may not finish in time.
                            v Using wall clock limits that accurately reflect the actual running time of the job
                              steps will result in a more efficient utilization of resources. When a job step’s
                              wall clock limit is substantially longer than the amount of time the job step
                              actually needs, it results in two inefficiencies in the BACKFILL algorithm:
                              – The future start time of a ″top dog″ will be calculated to be much later due to
                                 the long wall clock limits of the running job steps, leaving a larger window
                                 for BACKFILL job steps to run. This causes the ″top dog″ to start later than it
                                 would have if more accurate wall clock limits had been given.
                              – A job step is less likely to be backfilled if its wall clock limit is longer because
                                 it is more likely to run past the future start time of a ″top dog″.
                            v You should use only the default settings for the START expression and the other
                              job control functions described in “Managing job status through control
                              expressions” on page 68. If you do not use these default settings, jobs will still
                              run but the scheduler will not be as efficient. For example, the scheduler will not
                              be able to guarantee a time at which the highest priority job will run.
                            v You should configure any multiprocessor (SMP) nodes such that the number of
                              jobs that can run on a node (determined by the MAX_STARTERS keyword) is
                              always less than or equal to the number of processors on the node.
                            v Due to the characteristics of the BACKFILL algorithm, in some cases this
                              scheduler may not honor the MACHPRIO statement. For more information on
                              MACHPRIO, see “Setting negotiator characteristics and policies” on page 45.

    112   TWS LoadLeveler: Using and Administering
v When using PREEMPT_CLASS rules it is helpful to create a SYSPRIO
                     expression which is consistent with the preemption rules. This can be done by
                     using the ClassSysprio built-in variable with a multiplier, such as SYSPRIO:
                     (ClassSysprio * 10000) - QDate. If classes which appear on the left-hand side
                     of PREEMPT_CLASS rules are given a higher priority than those which appear
                     on the right, preemption won’t be required as often because the job steps which
                     can preempt will be higher in the queue than the job steps which can be
                     preempted.
                   v Entering llq -s against a top-dog step will display that this step is a top-dog.

            Example: BACKFILL scheduling
                   On a rack with 10 nodes, 8 of the nodes are being used by Job A.

                   Job B has the highest priority in the queue, and requires 10 nodes. Job C has the
                   next highest priority in the queue, and requires only two nodes. Job B has to wait
                   for Job A to finish so that it can use the freed nodes. Because Job A is only using 8
                   of the 10 nodes, the BACKFILL scheduler can schedule Job C (which only needs
                   the two available nodes) to run as long as it finishes before Job A finishes (and Job
                   B starts). To determine whether or not Job C has time to run, the BACKFILL
                   scheduler uses Job C’s wall_clock_limit value to determine whether or not it will
                   finish before Job A ends. If Job C has a wall_clock_limit of unlimited, it may not
                   finish before Job B’s start time, and it won’t be dispatched.

|   Data staging
|                  Data staging allows you to stage data needed by a job before the job begins
|                  execution and to move data back to archives when a job has finished execution. A
|                  job can use one inbound data staging step and one outbound data staging step.
|                  The inbound step will be the first to be executed and the outbound step, the last.

|                  LoadLeveler provides data staging for two scenarios:
|                  1. A single replica of the data files needed by a job have to be created on a
|                     common file system.
|                  2. A replica of the data files has to be created on every machine on which the job
|                     will run.

|                  LoadLeveler allows you to request the time at which data staging operations
|                  should be scheduled.
|                  1. A single replica must be created as soon as a job is submitted, regardless of
|                     when the job will be executed. This is the AT_SUBMIT configuration option.
|                  2. A single replica of the data files must be created as close as possible to
|                     execution time of the job. This is the JUST_IN_TIME configuration option.
|                  3. A replica must be created on each machine that the job runs on, as close as
|                     possible to execution time of the job. This is also the JUST_IN_TIME
|                     configuration option.

|                  The basic steps involved in data staging include:
|                  1. A job is submitted that contains data staging keywords.
|                  2. LoadLeveler generates inbound and outbound data staging steps in accordance
|                     with these keywords. All other steps of the job have an implicit dependency on
|                     the completion of the inbound data staging step.
|                  3. Scheduling methods:


                                                       Chapter 6. Performing additional administrator tasks   113
|                              a. With the AT_SUBMIT configuration option, the data staging step is started
|                                  first and the application steps are scheduled when its data staging
|                                  dependency is satisfied (that is, when the inbound data staging step is
|                                  completed).
|                              b. With the JUST_IN_TIME configuration option, the first application step of
|                                  the job is scheduled in the future based on the wall clock time specified for
|                                  the inbound data staging step. The inbound data staging step is started on
|                                  the machines that will be used by the first application step.
|                           4. When the inbound data staging step completes, all of the application job steps
|                              become eligible for scheduling. The exit code from the inbound data staging
|                              program is made available to all application job steps in the
|                              LL_DSTG_IN_EXIT_CODE environment variable.
|                           5. When all the application job steps are completed, the outbound data staging
|                              step is started by LoadLeveler. Typically, the outbound data staging step would
|                              be used to move data files back to their archives.

|                           Note: You cannot preempt data staging steps using the llpreempt command or by
|                                 specifying the data_stage class in system preemption rules. Similarly, a step
|                                 belonging to the data_stage class cannot preempt any other job step.

|                Configuring LoadLeveler to support data staging
|                           LoadLeveler allows you to specify the execution time for data staging job steps
|                           using the DSTG_TIME keyword. It defaults to the AT_SUBMIT value. To
|                           schedule data staging operation as close to the application as possible, the
|                           JUST_IN_TIME value can be used. DSTG_MIN_SCHEDULING_INTERVAL is a
|                           keyword used to optimize scheduler performance by allowing data staging jobs to
|                           be scheduled only at specific intervals.

|                           A special set of data staging step initiators, called DSTG_MAX_STARTERS, can be
|                           set up for data staging job steps. These initiators will be a distinct set of resources
|                           on the compute node, not included in the MAX_STARTERS set up for compute
|                           jobs. You cannot specify the built-in data_stage class in:
|                           v The CLASS keyword of a job command file
|                           v The default_class keyword in the administration file

|                           For more information about the data staging keywords, see “Configuration file
|                           keyword descriptions” on page 265.

|                           The LoadLeveler administration class stanza keywords can be used to specify
|                           defaults, limits, and restrictions for the built-in data_stage class. The data_stage
|                           class cannot be specified as the default class for a user. You cannot specify the
|                           data_stage class in your job command file. Steps of this class will be automatically
|                           generated by LoadLeveler based on the data staging keywords used in job
|                           command files.

|                           LoadLeveler provides a built-in class called data_stage that can be configured in
|                           the administration file using a class stanza, just as you would do for any other
|                           class. Some examples of how you might use a stanza for the data_stage class are:
|                           v Include and exclude users and groups from this class to control which users are
|                              permitted to use data staging.
|                           v Specifying defaults for resource limits such as cpu_limit or nofile_limit for data
|                              staging steps.


    114   TWS LoadLeveler: Using and Administering
|                 v Specifying defaults and maximum allowed values for the dstg_resources job
|                   command file keyword using default_resources and max_resources.
|                 v Limiting the total number of data staging jobs or tasks in the cluster at any one
|                   time using maxjobs or max_total_tasks.

|                 For more information about the data staging keywords, see “Administration file
|                 keyword descriptions” on page 327.

|                 If an inbound data staging job step is soft-bound to a reservation and keyword
|                 dstg_node=any, it can be started ahead of the reservation start time, if data staging
|                 resources are available. In all other cases, data staging steps will run within the
|                 reservation itself.

    Using an external scheduler
                  The LoadLeveler API provides interfaces that allow an external scheduler to
                  manage the assignment of resources to jobs and dispatching those jobs.

                  The primary interfaces for the tasks of an external scheduler are:
                  v ll_query to obtain information about the LoadLeveler cluster, the machines of
                    the cluster, jobs and AIX Workload Manager.
                  v ll_get_data to obtain information about specific objects such as jobs, machines
                    and adapters.
|                 v ll_start_job_ext to start a LoadLeveler job.
|                   – The ll_start_job_ext subroutine supports both serial and parallel jobs. For
|                      parallel jobs, ll_start_job_ext provides the ability to specify which adapters
|                      are used by the communication protocols of each job task. This assures that
|                      each task uses the same network for communication over a given protocol.

                  The steps for dispatching jobs with an external scheduler are:
                  1. Gather information about the LoadLeveler cluster ( ll_query(CLUSTER) ).
                  2. Gather information about the machines in the LoadLeveler cluster (
                     ll_query(MACHINES) ).
                  3. Gather information about the jobs in the cluster ( ll_query(JOBS) ).
                  4. Determine the resources that are currently free. (See the note that follows.)
                  5. Determine which jobs to start. Assign resources to jobs to be started and
                     dispatch ( ll_start_job_ext(LL_start_job_info_ext*) ).
                  6. Repeat steps 1 through 5.

                  When an external scheduler is used, the LoadLeveler Negotiator does not keep
                  track of the resources used by jobs started by the external scheduler. There are two
                  ways that an external scheduler can keep track of the free resources available for
                  starting new jobs. The method that should be used depends on whether the
                  external scheduler runs continuously while all scheduling is occurring or is
                  executed to start a finite number of jobs and then terminates:
                  v If the external scheduler runs continuously, it should query the total resources
                     available in the LoadLeveler system with ll_query and ll_get_data. Then it can
                     keep track of the resource assigned to jobs it starts while they are running and
                     return the resources to the available pool when the jobs complete.
                  v If the external scheduler is executed to start a finite number of jobs and then
                     terminates, it must determine the pool of available resources when it first starts.
                     It can do this by first querying the total resources in the LoadLeveler system
                     using ll_query and ll_get_data. Then it would query the jobs in the system

                                                      Chapter 6. Performing additional administrator tasks   115
(again using ll_query), looking for jobs that are running. For each running job, it
                               would remove the resources used by the job from the available pool. After all
                               the running jobs are processed, the available pool would indicate the amount of
                               free resource for starting new jobs.

                            To find out more about dispatching jobs with an external scheduler, use the
                            information in Table 24.
                            Table 24. Roadmap of tasks for using an external scheduler
                            Subtask                                    Associated instructions (see . . . )
                            Learn about the LoadLeveler functions      “Replacing the default LoadLeveler scheduling
                            that are limited or not available when     algorithm with an external scheduler”
                            you use an external scheduler
                            Prepare the LoadLeveler environment        “Customizing the configuration file to define an
                            for using an external scheduler            external scheduler” on page 118
                            Use an external scheduler to dispatch      v “Steps for getting information about the
                            jobs                                         LoadLeveler cluster, its machines, and jobs” on
                                                                         page 118
                                                                       v “Assigning resources and dispatching jobs” on
                                                                         page 122



                 Replacing the default LoadLeveler scheduling algorithm with
                 an external scheduler
                            It is important to know how LoadLeveler keywords and commands behave when
                            you replace the default LoadLeveler scheduling algorithm with an external
                            scheduler.

                            LoadLeveler scheduling keywords and commands fall into the following
                            categories:
                            v Keywords not involved in scheduling decisions are unchanged.
|                           v Keywords kept in the job object or in the machine which are used by the
|                             LoadLeveler default scheduler have their values maintained as before and
|                             passed to the data access API.
                            v Keywords used only by the LoadLeveler default scheduler have no effect.

                            Table 25 discusses specific keywords and commands and how they behave when
                            you disable the default LoadLeveler scheduling algorithm.
                            Table 25. Effect of LoadLeveler keywords under an external scheduler
                            Keyword type / name                      Notes
                            Job command file keywords
|                           class                                    This value is provided by the data access API.
|                                                                    Machines chosen by ll_start_job_ext must have the
|                                                                    class of the job available or the request will be
|                                                                    rejected.
|                           dependency                               Supported as before. Job objects for which
|                                                                    dependency cannot be evaluated (because a previous
|                                                                    step has not run) are maintained in the NotQueued
|                                                                    state, and attempts to start them using
|                                                                    ll_start_job_ext will result in an error. If the
|                                                                    dependency is met, ll_start_job_ext can start the
|                                                                    proc.


    116   TWS LoadLeveler: Using and Administering
Table 25. Effect of LoadLeveler keywords under an external scheduler (continued)
    Keyword type / name                   Notes
|   hold                                  ll_start_job_ext cannot start a job that is in Hold
|                                         status.
|   preferences                           Passed to the data access API.
|   requirements                          ll_start_job_ext returns an error if the specified
|                                         machines do not match the requirements of the job.
|                                         This includes Disk and Virtual Memory
|                                         requirements.
|   startdate                             The job remains in the Deferred state until the
|                                         startdate specified in the job is reached.
|                                         ll_start_job_ext cannot start a job in the Deferred
|                                         state.
|   user_priority                         Used in calculating the system priority (as described
|                                         in “Setting and changing the priority of a job” on
|                                         page 230). The system priority assigned to the job is
|                                         available through the data access API. No other
|                                         control of the order in which jobs are run is
|                                         enforced.
    Administration file keywords
    master_node_exclusive                 Ignored
    master_node_requirement               Ignored
    max_jobs_scheduled                    Ignored
    max_reservations                      Ignored
    max_reservation_duration              Ignored
    max_total_tasks                       Ignored
    maxidle                               Supported
    maxjobs                               Ignored
    maxqueued                             Supported
    priority                              Used to calculate the system priority (where
                                          appropriate).
|   speed                                 Available through the data access API.
    Configuration file keywords
    MACHPRIO                              Calculated but is not used.
|   MAX_STARTERS                          Calculated, and if starting the job causes this value
|                                         to be exceeded, ll_start_job_ext returns an error.
|   SYSPRIO                               Calculated and available to the data access API.
    NEGOTIATOR_PARALLEL_DEFER             Ignored
    NEGOTIATOR_PARALLEL_HOLD              Ignored
    NEGOTIATOR_RESCAN_QUEUE               Ignored
    NEGOTIATOR_RECALCULATE_               Works as before. Set this value to 0 if you do not
    SYSPRIO_INTERVAL                      want the system priorities of job objects recalculated.




                                          Chapter 6. Performing additional administrator tasks   117
Customizing the configuration file to define an external
                 scheduler
|                           To use an external scheduler, one of the tasks you must perform is setting the
|                           configuration file keyword SCHEDULER_TYPE to the value API.

                            This keyword option provides a time-based (rather than an event-based) interface.
                            That is, your application must use the data access API to poll LoadLeveler at
                            specific times for machine and job information.

                            When you enable a scheduler type of API, you must specify
                            AGGREGATE_ADAPTERS=NO to make the individual switch adapters available
                            to the external scheduler. This means the external scheduler receives each
                            individual adapter connected to the network, instead of collectively grouping them
                            together. You’ll see each adapter listed individually in the llstatus -l command
                            output. When this keyword is set to YES, the llstatus -l command will show an
                            aggregate adapter which contains information on all switch adapters on the same
                            network. For detailed information about individual switch adapters, issue the
                            llstatus -a command.

                            You also may use the PREEMPTION_SUPPORT keyword, which specifies the
                            level of preemption support for a cluster. Preemption allows for a running job step
                            to be suspended so that another job step can run.

                 Steps for getting information about the LoadLeveler cluster,
                 its machines, and jobs
                            There are steps to retrieve and use information about the LoadLeveler cluster,
                            machines, jobs and AIX Workload Manager.

                            Perform the following steps to retrieve and use information about the LoadLeveler
                            cluster, machines, jobs and AIX Workload Manager:
                            1. Create a query object for the kind of information you want.
                                Example: To query machine information, code the following instruction:
                                LL_element * query_element = ll_query(MACHINES);
                            2. Customize the query to filter the specific information you want. You can filter
                               the list of objects for which you want information. For some queries, you can
                               also filter how much information you want.
                               Example: The following lines customize the query for just hosts
                               node01.ibm.com and node02.ibm.com and to return the information contained
                               in the llstatus -f command:
                                char * hostlist[] = { "node01.ibm.com","node02.ibm.com",NULL };
                                ll_set_request(query_element,QUERY_HOST,hostlist,STATUS_LINE);
                            3. Once the query has been customized:
                               a. Submit it using ll_get_objs, which returns the first object that matches the
                                  query.
                                b. Interrogate the returned object using the ll_get_data command to retrieve
                                   specific attributes. Depending on the information being queried for, the
                                   query may be directed to a specific node and a specific daemon on that
                                   node.
                                Example: A JOBS query for all data may be directed to the negotiator, Schedd
                                or the history file. If it is directed to the Schedd, you must specify the host of



    118   TWS LoadLeveler: Using and Administering
the Schedd you are interested in. The following demonstrates retrieving the
   name of the first machine returned by the query constructed previously:
    int machine_count;
    int rc;
    LL_element * element =ll_get_objs(query_element,LL_CM,NULL,&machine_count,&rc)
    char * mname;
    ll_get_data(element,LL_MachineName,&mname);

   Because there is only one negotiator in a LoadLeveler cluster, the host does not
   have to be specified. The third parameter is the address of an integer that will
   receive the count of objects returned and the fourth parameter is the address of
   an integer that will receive the completion code of the call. If the call fails,
   NULL is returned and the location pointed to by the fourth parameter is set to
   a reason code. If the call succeeds, the value returned is used as the first
   parameter to a call to ll_get_data. The second parameter to ll_get_data is a
   specification that indicates what attribute of the object is being interrogated.
   The third parameter to ll_get_data is the address of the location into which to
   store the result. ll_get_data returns zero if it is successful and nonzero if an
   error occurs. It is important that the specification (the second parameter to
   ll_get_data) be valid for the object passed in (the first parameter) and that the
   address passed in as the third parameter point to the correct type for the
   specification. Undefined, potentially dangerous behavior will occur if either of
   these conditions is not met.

Example: Retrieving specific information about machines
The following example demonstrates printing out the name and adapter list of all
machines in the LoadLeveler cluster.

The example could be extended to retrieve all of the information available about
the machines in the cluster such as memory, disk space, pool list, features,
supported classes, and architecture, among other things. A similar process would
be used to retrieve information about the cluster overall.
 int i, w, rc;
 int machine_count;
 LL_element * query_elem;
 LL_element * machine;
 LL_element * adapter;
 char * machine_name;
 char * adapter_name;
 int * window_list;
 int window_count;

 /* First we need to obtain a query element which is used to pass      */
 /* parameters in to the machine query                                */
 if ((query_elem = ll_query(MACHINES)) == NULL)
   {
     fprintf(stderr,"Unable to obtain query elementn");
     /* without the query object we will not be able to do anything */
     exit(-1);
   }

 /* Get information relating to machines in the LoadLeveler cluster. */

 /* QUERY_ALL: we are querying all machines                                   */
 /* NULL: since we are querying all machines we do not need to                */
 /*       specify a filter to indicate which machines                         */
 /* ALL_DATA: we want all the information available about the machine         */
 rc=ll_set_request(query_elem,QUERY_ALL,NULL,ALL_DATA);
 if(rc<0)
   {
     /* A real application would map the return code to a message */

                                   Chapter 6. Performing additional administrator tasks   119
printf("
                                  /* Without customizing the query we cannot proceed */
                                  exit(rc);
                              }

                          /* If successful, ll_get_objs() returns the first object that       */
                          /* satisfies the criteria that are set in the query element and     */
                          /* the parameters. In this case those criteris are:                 */
                          /* A machine (from the type of query object)                        */
                          /* LL_CM: that the negotiator knows about                           */
                          /* NULL: since there is only one negotiator we don’t have to        */
                          /*       specify which host it is on                                */
                          /* The number of machines is returned in machine_count and the      */
                          /* return code is returned in rc                                    */
                          machine = ll_get_objs(query_elem,LL_CM,NULL,&machine_count,&rc);
                          if(rc<0)
                            {
                              /* A real application would map the return code to a message      */
                              printf("

                                  /* query was not successful -- we cannot proceed but we need to */
                                  /* release the query element                                    */
                                  if(ll_deallocate(query_elem) == -1)
                                    {
                                      fprintf(stderr,"Attempt to deallocate invalid query elementn");
                                    }
                                  exit(rc);
                              }

                          printf("Number of Machines =
                          i = 0;
                          while(machine!=NULL)
                            {
                              printf("------------------------------------------------------n");
                              printf("Machine

                                int rc = ll_get_data(machine,LL_MachineName,&machine_name);
                                if(0==rc)
                                  {
                             printf("Machine name =
                         }
                                else
                                  {
                             printf("Error
                                  }

                                 printf("Adaptersn");
                                 ll_get_data(machine,LL_MachineGetFirstAdapter,&adapter);
                                 while(adapter != NULL)
                                   {
                             rc = ll_get_data(adapter,LL_AdapterName,&adapter_name);
                                     if(0!=rc)
                               {
                                  printf("Error
                                       }
                             else
                               {
                                         /* Because the list of windows on an adapter is returned */
                                  /* as an array of integers, we also need to know how big */
                                  /* the list is. First we query the window count,         */
                                  /* storing the result in an integer, then we query for   */
                                  /* the list itself, storing the result in a pointer to   */
                                  /* an integer. The window list is allocated for us so    */
                                  /* we need to free it when we are done                   */

                                         printf("
                                         ll_get_data(adapter,LL_AdapterTotalWindowCount,&window_count);

120   TWS LoadLeveler: Using and Administering
ll_get_data(adapter,LL_AdapterWindowList,&window_list);
         for (w = 0;w<iBuffer;w++)
                   {
                     printf("
 }
             printf("n");
           }
         free(window_list);
  /* After the first object has been gotten, GetNext returns   */
  /* the next until the list is exhausted                      */
         ll_get_data(machine,LL_MachineGetNextAdapter,&adapter);
       }

         printf("n");
         i++;
         machine = ll_next_obj(query_elem);
     }

 /* First we need to release the individual objects that were                    */
 /* obtained by the query                                                        */
 if(ll_free_objs(query_elem) == -1)
   {
     fprintf(stderr,"Attempt to free invalid query elementn");
   }

 /* Then we need to release the query itself                          */
 if(ll_deallocate(query_elem) == -1)
   {
     fprintf(stderr,"Attempt to deallocate invalid query elementn");
   }

Example: Retrieving information about jobs
The following example may apply to your situation.

The following example demonstrates retrieving information about jobs up to the
point of starting a job:
 int i, rc;
 int job_count;
 LL_element * query_elem;
 LL_element * job;
 LL_element * step;
 int step_state;

 /* First we need to obtain a query element which is used to pass      */
 /* parameters in to the jobs query                                   */
 if ((query_elem = ll_query(JOBS)) == NULL)
   {
     fprintf(stderr,"Unable to obtain query elementn");
     /* without the query object we will not be able to do anything */
     exit(-1);
   }

 /* Get information relating to Jobs in the LoadLeveler cluster. */
 printf("Jobs Information ========================================nn");
 /* QUERY_ALL: we are querying all jobs                               */
 /* NULL: since we are querying all jobs we do not need to            */
 /*       specify a filter to indicate which jobs                     */
 /* ALL_DATA: we want all the information available about the job     */
 rc=ll_set_request(query_elem,QUERY_ALL,NULL,ALL_DATA);
 if(rc<0)
   {
     /* A real application would map the return code to a message */
     printf("
     /* Without customizing the query we cannot proceed */
     exit(rc);
   }

                                      Chapter 6. Performing additional administrator tasks   121
/* If successful, ll_get_objs() returns the first object that      */
                              /* satisfies the criteria that are set in the query element and    */
                              /* the parameters. In this case those criteris are:                */
                              /* A job (from the type of query object)                           */
                              /* LL_CM: that the negotiator knows about                          */
                              /* NULL: since there is only one negotiator we don’t have to       */
                              /*       specify which host it is on                               */
                              /* The number of jobs is returned in job_count and the             */
                              /* return code is returned in rc                                   */
                              job = ll_get_objs(query_elem,LL_CM,NULL,&job_count,&rc);
                              if(rc<0)
                                {
                                  /* A real application would map the return code to a message    */
                                  printf("

                                    /* query was not successful -- we cannot proceed but we need to */
                                    /* release the query element                                    */
                                    if(ll_deallocate(query_elem) == -1)
                                      {
                                        fprintf(stderr,"Attempt to deallocate invalid query elementn");
                                      }
                                    exit(rc);
                                }

                              printf("Number of Jobs =
                              step = NULL;
                              while(job!=NULL)
                                {
                                   /* Each job is composed of one or more steps which are started      */
                                   /* individually. We need to check the state of the job’s steps      */
                                   ll_get_data(job,LL_JobGetFirstStep,&step);
                                   while(step!=NULL)
                                     {
                               ll_get_data(step,LL_StepState,&step_state);
                               /* We are looking for steps that are in idle state. The       */
                               /* state is returned as an int so we cast it to               */
                               /* enum StepState as declared in llapi.h                      */
                                       if((enum StepState)step_state == STATE_IDLE)
                                  break;
                             }
                                   /* If we exit the loop with a valid step, it is the one to start    */
                                   /* otherwise we need to keep looking                                */
                                   if(step != NULL)
                                     break;

                                    ll_next_obj(query_elem);
                                }

                              if(step==NULL)
                                {
                                  printf("No step to startn");
                                  exit(0);
                                }

                 Assigning resources and dispatching jobs
|                           After an external scheduler selects a job step to start and identifies the machines
|                           that the job step will run on, the LoadLeveler job start API is used to tell
|                           LoadLeveler the job step to start and the resources that are to be assigned to the
|                           job step.

                            In “Example: Retrieving information about jobs” on page 121, we reached the point
                            where a step to start was identified. In a real external scheduler, the decision
                            would be reached after consideration of all the idle jobs and constructing a priority
    122   TWS LoadLeveler: Using and Administering
value based on attributes such as class and submit time, all of which are accessible
through ll_get_data. Next, the list of available machines would be examined to
determine whether a set exists with sufficient resources to run the job. This process
also involves determining the size of that set of machines using attributes of the
step such as number of nodes, instances of each node and tasks per node. The
LoadLeveler data query API allows access to that information about each job but
the interface for starting the job does not require that the machine and adapter
resource match the specifications when the job was submitted. For example, a job
could be submitted specifying node=4 but could be started by an external
scheduler on a single node only. Similarly, the job could specify the LAPI protocol
with network.lapi=... but be started and told to use the MPI protocol. This is not
considered an error since it is up to the scheduler to interpret (and enforce, if
necessary), the specifications in the job command file.

In allocating adapter resources for a step, it is important that the order of the
adapter usages be consistent with the structure of the step. In some environments a
task can use multiple instances of adapter windows for a protocol. If the protocol
requests striping (sn_all), an adapter window (or set of windows if instances are
used) is allocated on each available network. If multiple protocols are used by the
task (for example, MPI and LAPI), each protocol defines its own set of windows.
The array of adapter usages passed in to ll_start_job_ext must group the windows
for all of the instances on one network for the same protocol together. If the
protocol requests striping, that grouping must be immediately followed by the
grouping for the next network. If the task uses multiple protocols, the set of
adapter usages for the first protocol must be immediately followed by the set for
the next protocol. Each task will have exactly the same pattern of adapter usage
entries. Corresponding entries across all the tasks represent a communication path
and must be able to communicate with each other. If the usages are for User Space
communication, a network table will be loaded for each set of corresponding
entries.

All of the job command file keywords for specifying job structure such as
total_tasks, tasks_per_node, node=min,max and blocking are supported by the
ll_start_job_ext interface but users should ensure that they understand the
LoadLeveler model that is created for each combination when constructing the
adapter usage list for ll_start_job_ext. Jobs that are submitted with node=number
and tasks_per_node result in more regular LoadLeveler models and are easier to
create adapter usage lists for.

In the following example, it is assumed that the step found to be dispatched will
run on one machine with two tasks, each task using one switch adapter window
for MPI communication. The name of the machine to run on is contained in the
variable use_machine (char*), the names of the switch adapters are contained in
use_adapter_1 (char *) and use_adpater_2 (char *) and the adapter windows on
those adapters in use_window_1 int) and use_window_2 (int), respectively.
Further more, each adapter will be allocated 1M of memory.

If the network adapters that the external scheduler assigns to the job allocate
communication buffers in rCxt blocks instead of bytes (the Switch Network
Interface for HPS is an example of such a network adapter), the api_rcxtblocks
field of adapterUsage should be used to specify the number of rCxt blocks to
assign instead of the mem field.
  LL_start_job_info_ext *start_info;
  char * pChar;
  LL_element * step;
  LL_element * job;


                                   Chapter 6. Performing additional administrator tasks   123
int rc;
                          char * submit_host;
                          char * step_id;

                          start_info = (LL_start_job_info_ext *)(malloc(sizeof(LL_start_job_info_ext)));
                          if(start_info == NULL)
                            {
                              fprintf(stderr, "Out of memory.n");
                              return;
                            }

                          /* Create a NULL terminated list of target machines. Each task         */
                          /* must have an entry in this list and the entries for tasks on the    */
                          /* same machine must be sequential. For example, if a job is to run    */
                          /* on two machines, A and B, and three tasks are to run on each        */
                          /* machine, the list would be: AAABBB                                  */
                          /* Any specifications on the job when it was submitted such as         */
                          /* nodes, total_tasks or tasks_per_node must be explicitly queried     */
                          /* and honored by the external scheduler in order to take effect.      */
                          /* They are not automatically enforced by LoadLeveler when an          */
                          /* external scheduler is used.                                         */
                          /*                                                                     */
                          /* In this example, the job will only be run on one machine            */
                          /* with only one task so the machine list consists of only 1 machine   */
                          /* (plus the terminating NULL entry)                                   */
                          start_info->nodeList = (char **)malloc(2*sizeof(char *));
                          if (!start_info->nodeList)
                            {
                              fprintf(stderr, "Out of memory.n");
                              return;
                            }

                          start_info->nodeList[0] = strdup(use_machine);
                          start_info->nodeList[1] = NULL;

                          /*   Retrieve information from the job to populate the start_info      */
                          /*   structure                                                         */
                          /*   In the interest of brevity, the success of the ll_get_data()      */
                          /*   is not tested. In a real application it shuld be                  */

                          /* The version number is set from the header that is included when  */
                          /* the application using the API is compiled. This allows for       */
                          /* checking that the application was compiled with a version of the */
                          /* API that is compatible with the version in the library when the  */
                          /* application is run.                                              */
                          start_info->version_num = LL_PROC_VERSION;

                          /* Get the first step of the job to start                              */
                          ll_get_data(job,LL_JobGetFirstStep,&step);
                          if(step==NULL)
                            {
                              printf("No step to startn");
                              return;
                            }

                          /* In order to set the submitting host, cluster number and proc        */
                          /* number in the start_info structure, we need to parse it out of      */
                          /* the step id                                                         */

                          /* First get the submitting host and save it                           */
                          ll_get_data(job,LL_JobSubmitHost,&submit_host);
                          start_info->StepId.from_host = strdup(submit_host);
                          free(submit_host);

                          rc = ll_get_data(step, LL_StepID, &step_id);

                          /* The step id format is submit_host.jobno.stepno .    Because the     */

124   TWS LoadLeveler: Using and Administering
/*   submit host is a dotted string of indeterminant length, the              */
/*   simplest way to detect where the job number starts is to retrieve        */
/*   the submit host from the job and skip forward its length in the          */
/*   step id.                                                                 */

pChar = step_id+strlen(start_info->StepId.from_host)+1;
/* The next segment is the cluster or job number                              */
pChar = strtok(pChar,".");
start_info->StepId.cluster=atoi(pChar);
/* The last token is the proc or step number                                  */
pChar = strtok(NULL,".");
start_info->StepId.proc = atoi(pChar);
free(step_id);

/* For each protocol (eg. MPI or LAPI) on each task, we need to               */
/* specify which adapter to use, whether a window is being used               */
/* (subsystem = "US") or not (subsytem="IP"). If a window is used,            */
/* the window ID and window buffer size must be specified.                    */
/*                                                                            */
/* The adapter usage entries for the protocols of a task must be              */
/* sequential and the set of entries for tasks on the same node must          */
/* be sequential. For example the twelve entries for a job where              */
/* each task uses one window for MPI and one for LAPI with three              */
/* tasks per node and running on two nodes would be laid out as:              */
/* 1: MPI window for 1st task running on 1st node                             */
/* 2: LAPI window for 1st task running on 1st node                            */
/* 3: MPI window for 2nd task running on 1st node                             */
/* 4: LAPI window for 2nd task running on 1st node                            */
/* 5: MPI window for 3rd task running on 1st node                             */
/* 6: LAPI window for 3rd task running on 1st node                            */
/* 7: MPI window for 1st task running on 2nd node                             */
/* 8: LAPI window for 1st task running on 2nd node                            */
/* 9: MPI window for 2nd task running on 2nd node                             */
/* 10: LAPI window for 2nd task running on 2nd node                           */
/* 11: MPI window for 3rd task running on 2nd node                            */
/* 12: LAPI window for 3rd task running on 2nd node                           */
/* An improperly ordered adapter usage list may cause the job not to          */
/* be started or, if started, incorrect execution of the job                  */
/*                                                                            */
/* This example starts the job with two tasks on one machine, using           */
/* one switch adapter window on each task. The protocol is forced             */
/* to MPI and a fixed window size of 1M is used. An actual external           */
/* scheduler application would check the steps requirements and its           */
/* adapter requirements of the step with ll_get_data                          */
/*                                                                            */
start_info->adapterUsageCount = 2;
start_info->adapterUsage =
  (LL_ADAPTER_USAGE *)malloc((start_info->adapterUsageCount)
                             * sizeof(LL_ADAPTER_USAGE));

start_info->adapterUsage[0].dev_name = use_adapter_1;
start_info->adapterUsage[0].protocol = "MPI";
start_info->adapterUsage[0].subsystem = "US";
start_info->adapterUsage[0].wid = use_window_1;
start_info->adapterUsage[0].mem = 1048577;

start_info->adapterUsage[1].dev_name = use_adapter_2;
start_info->adapterUsage[1].protocol = "MPI";
start_info->adapterUsage[1].subsystem = "US";
start_info->adapterUsage[1].wid = use_window_2;
start_info->adapterUsage[1].mem = 1048577;

if ((rc = ll_start_job_ext(start_info)) != API_OK)
  {
    printf("Error %d returned attempting to start Job Step %s.%d.%d on %sn",
           rc,
    start_info->StepId.from_host,

                                   Chapter 6. Performing additional administrator tasks   125
start_info->StepId.cluster,
                               start_info->StepId.proc,
                               start_info->nodeList[0]
                               );
                            }
                          else
                            {
                              printf("ll_start_job_ext() invoked to start job step: "
                                     "%s.%d.%d on machine: %s.nn",
                                     start_info->StepId.from_host, start_info->StepId.cluster,
                                     start_info->StepId.proc, start_info->nodeList[0]);
                            }
                          free(start_info->nodeList[0]);
                          free(start_info);

                        Finally, when the step and job element are no longer in use, ll_free_objs() and
                        ll_deallocate() should be called on the query element.

Example: Changing scheduler types
                        You can toggle between the default LoadLeveler scheduler and other types of
                        schedulers by using the SCHEDULER_TYPE keyword.

                        Changes to SCHEDULER_TYPE will not take effect at reconfiguration. The
                        administrator must stop and restart or recycle LoadLeveler when changing
                        SCHEDULER_TYPE. A combination of changes to SCHEDULER_TYPE and some
                        other keywords may terminate LoadLeveler.

                        The following example illustrates how you can toggle between the default
                        LoadLeveler scheduler and an external scheduler, such as the Extensible Argonne
                        Scheduling sYstem (EASY), developed by Argonne National Laboratory and
                        available as public domain code.

                        If you are running the default LoadLeveler scheduler, perform the following steps
                        to switch to an external scheduler:
                        1. In the configuration file, set SCHEDULER_TYPE = API
                        2. On the central manager machine:
                            v Issue llctl -g stop and llctl -g start, or
                            v Issue llctl -g recycle
                        If you are running an external scheduler, this is how you can re-enable the
                        LoadLeveler scheduling algorithm:
                        1. In the configuration file, set SCHEDULER_TYPE = LL_DEFAULT
                        2. On the central manager machine:
                            v Issue llctl -g stop and llctl -g start, or
                            v Issue llctl -g recycle

Preempting and resuming jobs
                        The BACKFILL scheduler allows LoadLeveler jobs to be preempted so that a
                        higher priority job step can run.

                        Administrators may specify not only preemption rules for job classes, but also the
                        method that LoadLeveler uses to preempt jobs. The BACKFILL scheduler supports
                        various methods of preemption.




126   TWS LoadLeveler: Using and Administering
Use Table 26 to find more information about preemption.
      Table 26. Roadmap of tasks for using preemption
      Subtask                        Associated instructions (see . . . )
      Learn about types of           “Overview of preemption”
      preemption and what it
      means for preempted jobs
      Prepare the LoadLeveler        “Planning to preempt jobs” on page 128
      environment and jobs for
      preemption
      Configure LoadLeveler to use   “Steps for configuring a scheduler to preempt jobs” on page
      preemption                     130



Overview of preemption
      LoadLeveler supports two types of preemption.

      The types of preemption thatLoadLeveler supports are of the following two types:
      v System-initiated preemption
        – Automatically enforced by LoadLeveler, except for job steps running under a
           reservation.
        – Governed by the PREEMPT_CLASS rules defined in the global configuration
           file.
        – When resources required by an incoming job are in use by other job steps, all
           or some of those job steps in certain classes may be preempted according to
           the PREEMPT_CLASS rules.
        – An automatically preempted job step will be resumed by LoadLeveler when
           resources become available and conditions such as START_CLASS rules are
           satisfied.
        – An automatically preempted job step cannot be resumed using llpreempt
           command or ll_preempt subroutine.
      v User-initiated preemption
        – Manually initiated by LoadLeveler administrators using llpreempt command
           or ll_preempt subroutine.
        – A manually preempted job step cannot be resumed automatically by
           LoadLeveler.
        – A manually preempted job step can be resumed using llpreempt command or
           ll_preempt subroutine. Issuing this command or subroutine, however, does
           not guarantee that the job step will successfully be resumed. A manually
           preempted job step that was resumed through these interfaces competes for
           resources with system-preempted job steps, and will be resumed only when
           resources become available.
        – All steps in a set of coscheduled job steps will be preempted if one or more
           steps in the step is preempted.
        – A coscheduled step will not be resumed until all steps in the set of
           coscheduled job steps can be resumed.

      For the BACKFILL scheduler only, administrators may select which method
      LoadLeveler uses to preempt and resume jobs. The suspend method is the default
      behavior, and is the preemption method LoadLeveler uses for any external
      schedulers that support preemption. For more information about preemption
      methods, see “Planning to preempt jobs” on page 128.




                                            Chapter 6. Performing additional administrator tasks   127
For a preempted job to be resumed after system- or user-initiated preemption
                        occurs through a method other than suspend, the restart keyword in the job
                        command file must be set to yes. Otherwise, LoadLeveler vacates the job step and
                        removes it from the cluster.

                        In order to determine the preempt type and preempt method to use when a
                        coscheduled step preempts another step, an order of precedence for preempt types
                        and preempt methods has been defined. All steps in the preempting coscheduled
                        step will be examined and the preempt type and preempt method having the
                        highest precedence will be used. The order of precedence for preempt type will be
                        ALL, ENOUGH. The precedence order for preempt method will be remove, vacate,
                        system hold, user hold, suspend.

                        When coscheduled steps are running, if one step is preempted as a result of a
                        system initiated preemption, then all coscheduled steps will be preempted. This
                        implies that more resource than necessary might be preempted when one of the
                        steps being preempted is a coscheduled step.

             Planning to preempt jobs
                        There are points to consider when planning to use preemption.

                        Consider the following points when planning to use preemption:
                        v Avoiding circular preemption under the BACKFILL scheduler
                          BACKFILL scheduling enables job preemption using rules specified with the
                          PREEMPT_CLASS keyword. When you are setting up the preemption rules,
                          make sure that you do not create a circular preemption path. Circular
                          preemption causes a job class to preempt itself after applying the preemption
                          rules recursively. For example, the following keyword definitions set up circular
                          preemption rules on Class_A:
                           PREEMPT_CLASS[Class_A] = ALL { Class_B }
                           PREEMPT_CLASS[Class_B] = ALL { Class_C }
                           PREEMPT_CLASS[Class_C] = ENOUGH { Class_A }
                           Another example of circular preemption involves allclasses:
                           PREEMPT_CLASS[Class_A] = ENOUGH {allclasses}
                           PREEMPT_CLASS[Class_B] = ALL {Class_A}

                          In this instance, allclasses means all classes except Class_A, any additional
                          preemption rule preempting Class_A causes circular preemption.
                        v Understanding implied START_CLASS values
                          Using the ″ALL″ value in the PREEMPT_CLASS keyword places implied
                          restrictions on when a job can start. For example,
                           PREEMPT_CLASS[Class_A] = ALL {Class_B Class_C}

                           tells LoadLeveler two things:
                           1. If a new Class_A job is about to run on a node set, then preempt all Class_B
                               and Class_C jobs on those nodes
                           2. If a Class_A job is running on a node set, then do not start any Class_B or
                               Class_C jobs on those nodes
                           This PREEMPT_CLASS statement also implies the following START_CLASS
                           expressions:
                           1. START_CLASS[Class_B] = (Class_A < 1)
                           2. START_CLASS[Class_C] = (Class_A < 1)



128   TWS LoadLeveler: Using and Administering
LoadLeveler adds all implied START_CLASS expressions to the START_CLASS
  expressions specified in the configuration file. This overrides any existing values
  for START_CLASS.
  For example, if the configuration file contains the following statements:
  PREEMPT_CLASS[Class_A] = ALL {Class_B Class_C}
  START_CLASS[Class_B] = (Class_A < 5)
  START_CLASS[Class_C] = (Class_C < 3)

  When LoadLeveler runs through the configuration process, the
  PREEMPT_CLASS statement on the first line generates the two implied
  START_CLASS statements. When the implied START_CLASS statements get
  added in, the user specified START_CLASS statements are overridden and the
  resulting START_CLASS statements are effectively equivalent to:
  START_CLASS[Class_B] = (Class_A < 1)
  START_CLASS[Class_C] = (Class_C < 3) && (Class_A < 1)

  Note: LoadLeveler’s central manager (CM) uses these effective expressions
         instead of the original statements specified in the configuration file. The
         output from llclass -l displays the original customer specified
         START_CLASS expressions.
v Selecting the preemption method under the BACKFILL scheduler
  Use Table 27 and Table 28 on page 130 to determine which preemption you want
  to use for jobs running under the BACKFILL scheduler. You may define one or
  more of the following:
  – A default preemption method to be used for all job classes, by setting the
     DEFAULT_PREEMPT_METHOD keyword in the configuration file.
  – A specific preemption method for one or more classes or job steps, by using
     an option on:
     - The PREEMPT_CLASS statement in the configuration file.
     - The llpreempt command, ll_preempt subroutine or ll_preempt_jobs
       subroutine.

  Note:
          1. Process tracking must be enabled in order to use the suspend method
             to preempt a job. To configure LoadLeveler for process tracking, see
             “Tracking job processes” on page 70.
          2. For a preempted job to be resumed after system- or user-initiated
             preemption occurs through a method other than suspend and remove,
             the restart keyword in the job command file must be set to yes.
             Otherwise, LoadLeveler vacates the job step and removes it from the
             cluster.
Table 27. Preemption methods for which LoadLeveler automatically resumes preempted jobs
Preemption                        LoadLeveler resumes preempted job:
method
(abbreviation)   At this time              At this location         At this processing point
Suspend (su)     When preempting job       On the same nodes        At the point of suspension
                 completes
Vacate (vc)      When nodes are            Any nodes that meet At the beginning or at the
                 available                 job requirements    last successful checkpoint




                                       Chapter 6. Performing additional administrator tasks   129
Table 28. Preemption methods for which administrator or user intervention is required
                            Preemption                                         LoadLeveler resumes preempted job:
                            method
                            (abbreviation)   Required intervention          At this location   At this processing point
                            Remove (rm)      Administrator or user must     Any nodes that     At the beginning or at
                                             resubmit the preempted job     meet job           the last successful
                                                                            requirements,      checkpoint
                            System Hold      Administrator must release
                                                                            when they are
                            (sh)             the preempted job
                                                                            available
                            User Hold (uh) User must release the
                                           preempted job

                            v Understanding how LoadLeveler treats resources held by jobs to be
                              preempted
                              When a job step is running, it may be holding the following resources:
                              – Processors
                              – Scheduling slots
                              – Real memory
|                             – ConsumableCpus, ConsumableMemory, ConsumableVirtualMemory, and
|                                 ConsumableLargePageMemory
                              – Communication switches, if the PREEMPTION_TYPE keyword is set to FULL
                                  in the configuration file.
                              When LoadLeveler suspends preemptable jobs running under the BACKFILL
                              scheduler, certain resources held by those jobs do not become available for the
|                             preempting jobs. These resources include ConsumableVirtualMemory,
|                             ConsumableLargePageMemory, and floating resources. Under the BACKFILL
                              scheduler only, LoadLeveler releases these resources when you select a
                              preemption method other than suspend. For all preemption methods other than
                              suspend, LoadLeveler treats all job-step resources as available when it preempts
                              the job step.
                            v Understanding how LoadLeveler processes multiple entries for the same
                              keywords
                              If there are multiple entries for the same keyword in either a configuration file
                              or an administration file, the last entry wins. For example, the following
                              statements are all valid specifications for the same keyword START_CLASS:
                               START_CLASS [Class_B] = (Class_A < 1)
                               START_CLASS [Class_B] = (Class_B < 1)
                               START_CLASS [Class_B] = (Class_C < 1)

                               However, all three statements identify Class_B as the incoming class.
                               LoadLeveler resolves these statements according to the ″last one wins″ rule.
                               Because of that, the actual value used for the keyword is (Class_C < 1).

                 Steps for configuring a scheduler to preempt jobs
                            You need to know certain details about the job characteristics and workload at
                            your installation before you begin to define rules for starting and preempting jobs.

                            Before you begin:
                            v To define rules for starting and preempting jobs, you need to know certain
                              details about the job characteristics and workload at your installation, including:
                              – Which jobs require the same resources, or must be run on the same machines,
                                 and so on. This knowledge allows you to group specific jobs into a class.
                              – Which jobs or classes have higher priority than others. This knowledge allows
                                 you to define which job classes can preempt other classes.

    130   TWS LoadLeveler: Using and Administering
v To correctly configure LoadLeveler to preempt jobs, you might need to refer to
                    the following information:
                    – “Choosing a scheduler” on page 44.
                    – “Planning to preempt jobs” on page 128.
                    – Chapter 12, “Configuration file reference,” on page 263.
                    – Chapter 13, “Administration file reference,” on page 321.
                    – “llctl - Control LoadLeveler daemons” on page 439.

                  Perform the following steps to configure a scheduler to preempt jobs:
                  1. In the configuration file, use the SCHEDULER_TYPE keyword to define the
                     type of LoadLeveler or external scheduler you want to use. Of the LoadLeveler
                     schedulers, only the BACKFILL scheduler supports preemption.
                     Rule: If you select the BACKFILL or API scheduler, you must set the
                     PREEMPTION_SUPPORT configuration keyword to either full or no_adapter.
                  2. (Optional) In the configuration file, use the DEFAULT_PREEMPT_METHOD
                     to define the default method that the BACKFILL scheduler should use for
                     preempting jobs.
|                    Alternative: You also may set the preemption method through the
|                    PREEMPT_CLASS keyword or on the LoadLeveler preemption command or
|                    APIs, which override the setting for the DEFAULT_PREEMPT_METHOD
|                    keyword.
                  3. For either the BACKFILL or API scheduler, to preempt by the suspend method
                     requires that you set the PROCESS_TRACKING configuration keyword to
                     true.
                  4. In the configuration file, use the PREEMPT_CLASS and START_CLASS to
                     define the preemption and start policies for job classes.
                  5. In the administration file, use the max_total_tasks keyword to define the
                     maximum number of tasks that may be run per user, group, or class.
                  6. On the central manager machine:
                     v Issue llctl -g stop and llctl -g start, or
                     v Issue llctl -g recycle

                  When you are done with this procedure, you can use the llq command to
                  determine whether jobs are being preempted and resumed correctly. If not, use the
                  LoadLeveler logs to trace the actions of each daemon involved in preemption to
                  determine the problem.

    Configuring LoadLeveler to support reservations
|                 Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make
|                 reservations or recurring reservations, which specify one or more time periods
|                 during which specific node resources are reserved for use by particular users or
|                 groups.

                  Normally, jobs wait to be dispatched until the resources they require become
                  available. Through the use of reservations, wait time can be reduced because only
|                 jobs that are bound to the reservation may use the node resources as soon as the
                  reservation period begins.




                                                    Chapter 6. Performing additional administrator tasks   131
Reservation tasks for administrators

                            Use Table 29 to find additional information about reservations.
                            Table 29. Roadmap of reservation tasks for administrators
                            Subtask                                      Associated instructions (see . . . )
                            Learn how reservations work in the           v “Overview of reservations” on page 25
                            LoadLeveler environment
                                                                         v “Understanding the reservation life cycle”
                                                                           on page 214
                            Configuring a LoadLeveler cluster to         v “Steps for configuring reservations in a
                            support reservations                           LoadLeveler cluster”
                                                                         v “Examples: Reservation keyword
                                                                           combinations in the administration file” on
                                                                           page 134
                                                                         v “Collecting accounting data for reservations”
                                                                           on page 63
                            Working with reservations:                   “Working with reservations” on page 213
                            v Creating reservations
                            v Submitting jobs under a reservation
                            v Managing reservations
                            Correctly coding and using administration    v Chapter 13, “Administration file reference,”
                            and configuration keywords                     on page 321
                                                                         v Chapter 12, “Configuration file reference,”
                                                                           on page 263



                 Steps for configuring reservations in a LoadLeveler cluster
                            Only the BACKFILL scheduler supports the use of reservations.

                            Before you begin:
                            v For information about configuring the BACKFILL scheduler, see “Choosing a
                              scheduler” on page 44.
                            v You need to decide:
                              – Which users will be allowed to create reservations.
                              – How many reservations users may own, and how long a duration for their
                                 reservations will be allowed.
                              – Which nodes will be used for reservations.
                              – How much setup time is required before the reservation period starts.
                              – Whether accounting data for reservations is to be saved.
|                             – The maximum lifetime for a recurring reservation before you require the user
|                                to request a new reservation for that job.
|                             – Additional system-wide limitations that you may want to implement such as
|                                maintenance time blocks for specific node sets.
                            v For examples of possible reservation keyword combinations, see “Examples:
                              Reservation keyword combinations in the administration file” on page 134.
                            v For details about specific keyword syntax and use:
                              – In the administration file, see Chapter 13, “Administration file reference,” on
                                 page 321.
                              – In the configuration file, see Chapter 12, “Configuration file reference,” on
                                 page 263.

|                           Perform the following steps to configure reservations:


    132   TWS LoadLeveler: Using and Administering
1. In the administration file, modify the user or group stanzas to authorize users
       to create reservations. You may grant the ability to create reservations to an
       individual user, a group of users, or a combination of users and groups. To do
       so, define the following keywords in the appropriate user or group stanzas:
       v max_reservations, to set the maximum number of reservations that a user or
          group may have.
       v (Optional) max_reservation_duration, to set the maximum amount of time
          for the reservation period.
       Tip: To quickly set up and use reservations, use one of the following examples:
       v To allow every user to create a reservation, add max_reservations=1 to the
          default user stanza. Then every administrator or user may create a
          reservation, as long as the number of reservations has not reached the limit
          for a LoadLeveler cluster.
       v To allow a specific group of users to make 10 reservations, add
          max_reservations=10 to the group stanza for that LoadLeveler group. Then
          every user in that group may create a reservation, as long as the number of
          reservations has not reached the limit for that group or for a LoadLeveler
          cluster.
       See the max_reservations description in Chapter 13, “Administration file
       reference,” on page 321 for more information about setting this keyword in the
       user or group stanza.
    2. In the administration file, modify the machine stanza of each machine that may
       be reserved. To do so, set the reservation_permitted keyword to true.
       Tip: If you want to allow every machine to be reserved, you do not have to set
       this keyword; by default, any LoadLeveler machine may be reserved. If you
       want to prevent particular machines from being reserved, however, you must
       define a machine stanza for that machine and set the reservation_permitted
       keyword to false.
    3. In the global configuration file, set reservation policy by specifying values for
       the following keywords:
       v MAX_RESERVATIONS to specify the maximum number of reservations per
          cluster.

|        Note: A recurring reservation only counts as one reservation towards the
|               MAX_RESERVATIONS limit regardless of the number of times that
|               the reservation recurs.
       v RESERVATION_CAN_BE_EXCEEDED to specify whether LoadLeveler will
         be permitted to schedule job steps bound to a reservation when their
         expected end times exceed the reservation end time.
         The default for this keyword is TRUE, which means that LoadLeveler will
         schedule these bound job steps even when they are expected to continue
         running beyond the time at which the reservation ends. Whether these job
         steps run and successfully complete depends on resource availability, which
         is not guaranteed after the reservation ends. In addition, these job steps
         become subject to preemption rules after the reservation ends.
         Tip: You might want to set this keyword value to FALSE to prevent users
         from binding long-running jobs to run under reservations of short duration.
       v RESERVATION_MIN_ADVANCE_TIME to define the minimum time
         between the time at which a reservation is created and the time at which the
         reservation is to start.
         Tip: To reduce the impact to the currently running workload, consider
         changing the default for this keyword, which allows reservations to begin as
         soon as they are created. You may, for example, require reservations to be

                                       Chapter 6. Performing additional administrator tasks   133
made at least one day (1440 minutes) in advance, by specifying
                              RESERVATION_MIN_ADVANCE_TIME=1440 in the global configuration file.
                           v RESERVATION_PRIORITY to define whether LoadLeveler administrators
                              may reserve nodes on which running jobs are expected to end after the start
                              time for the reservation.
                              Tip: The default for this keyword is NONE, which means that LoadLeveler will
                              not reserve a node on which running jobs are expected to end after the start
                              time for the reservation. If you want to allow LoadLeveler administrators to
                              reserve specific nodes regardless of the expected end times of job steps
                              currently running on the node, set this keyword value to HIGH. Note,
                              however, that setting this keyword value to HIGH might increase the number
                              of job steps that must be preempted when LoadLeveler sets up the
                              reservation, and many jobs might remain in Preempted state. This also
                              applies to Blue Gene job steps.
                              This keyword value applies only for LoadLeveler administrators; other
                              reservation owners do not have this capability.
                           v RESERVATION_SETUP_TIME to define the amount of time LoadLeveler
                              uses to prepare for a reservation before it is to start.
                        4. (Optional) In the global configuration file, set controls for the collection of
                           accounting data for reservations:
                           v To turn on accounting for reservations, add the A_RES flag to the ACCT
                              keyword.
                           v To specify a file other than the default history file to contain the data, use the
                              RESERVATION_HISTORY keyword.
                           To learn how to collect accounting data for reservations, see “Collecting
                           accounting data for reservations” on page 63.
                        5. If LoadLeveler is already started, to process the changes you made in the
                           preceding steps, issue the command llctl -g reconfig.
                           Tip: If you have changed the value of only the RESERVATION_PRIORITY
                           keyword, issue the command llctl reconfig only on the central manager node.
                           Result: The new keyword values take effect immediately, but they do not
                           change the attributes of existing reservations.

                        When you are done with this procedure, you may perform additional tasks
                        described in “Working with reservations” on page 213.

                        Examples: Reservation keyword combinations in the
                        administration file
                        The following examples demonstrate LoadLeveler behavior when the
                        max_reservations and max_reservation_duration keywords are set.

                        The examples assume that only the user and group stanzas listed exist in the
                        LoadLeveler administration file.
                        v Example 1: Assume the administration file contains the following stanzas:
                           default: type = user
                                    maxjobs = 10

                           group2: type = group
                                   include_users = rich dave steve

                           rich: type = user
                                 default_group = group2




134   TWS LoadLeveler: Using and Administering
This example shows that, by default, no one is allowed to make any
  reservations. No one, including LoadLeveler administrators, is permitted to
  make any reservations unless the max_reservations keyword is used.
v Example 2: Assume the administration file contains the following stanzas:
  default: type = user
           maxjobs = 10

  group2: type = group
          include_users = rich dave steve

  rich: type = user
        default_group = group2
        max_reservations = 5
  This example shows how permission to make reservations can be granted to a
  specific user through the user stanza only. Because the max_reservations
  keyword is not used in any group stanza, by default, the group stanzas neither
  grant permissions nor put any restrictions on reservation permissions. User Rich
  can make reservations in any group (group2, No_Group, Group_A, and so on),
  whether or not the group stanzas exist in the LoadLeveler administration file.
  The total number of reservations user Rich can own at any given time is limited
  to five.
v Example 3: Assume the administration file contains the following stanzas:
  default: type = user
           maxjobs = 10

  group2: type = group
          include_users = rich dave steve
          max_reservations = 5

  rich: type = user
        default_group = group2
  This example shows how permission to make reservations can be granted to a
  group of users through the group stanza only. Because the max_reservations
  keyword is not used in any user stanza, by default, the user stanzas neither
  grant nor deny permission to make reservations. All users in group2 (Rich, Dave
  and Steve) can make reservations, but they must make reservations in group2
  because other groups do not grant the permission to make reservations. The
  total number of reservations the users in group2 can own at any given time is
  limited to five.
v Example 4: Assume the administration file contains the following stanzas:
  default: type = user
           maxjobs = 10

  group2: type = group
          include_users = rich dave steve
          max_reservations = 5

  rich: type = user
        default_group = group2
        max_reservations = 0
  This example shows how permission to make reservations can be granted to a
  group of users except one specific user. Because the max_reservations keyword
  is set to zero in the user stanza for Rich, he does not have permission to make
  any reservation, even though all other users in group2 (Dave and Steve) can
  make reservations.
v Example 5: Assume the administration file contains the following stanzas:



                                  Chapter 6. Performing additional administrator tasks   135
default: type = group
                                    max_reservations = 0

                           default: type = user
                                    max_reservations = 0

                           group2: type = group
                                   include_users = rich dave steve
                                   max_reservations = 5

                           rich: type = user
                                 default_group = group2
                                 max_reservations = 5

                           dave: type = user
                                 max_reservations = 2
                           This example shows how permission to make reservations can be granted to
                           specific user and group pairs. Because the max_reservations keyword is set to
                           zero in both the default user and group stanza, no one has permission to make
                           any reservation unless they are specifically granted permission through both the
                           user and group stanza. In this example:
                           – User Rich can own at any time up to five reservations in group2 only.
                           – User Dave can own at any time up to two reservations in group2 only.
                          The total number of reservations they can own at any given time is limited to
                          five. No other combination of user or group pairs can make any reservations.
                        v Example 6: Assume the administration file contains the following stanzas:
                           default: type = user
                                    max_reservations = 1
                          This example permits any user to make one reservation in any group, until the
                          number of reservations reaches the maximum number allowed in the
                          LoadLeveler cluster.
                        v Example 7: Assume the administration file contains the following stanzas:
                           default: type = group
                                    max_reservations = 0

                           default: type = user
                                    max_reservations = 0

                           group1: type = group
                                   max_reservations = 6
                                   max_reservation_duration = 1440

                           carol: type = user
                                  default_group = group1
                                  max_reservations = 4
                                  max_reservation_duration = 720

                           dave: type = user
                                 default_group = group1
                                 max_reservations = 4
                                 max_reservation_duration = 2880
                           In this example, two users, Carol and Dave, are members of group1. Neither
                           Carol nor Dave belong to any other group with a group stanza in the
                           LoadLeveler administration file, although they may use any string as the name
                           of a LoadLeveler group and belong to it by default.
                           Because the max_reservations keyword is set to zero in the default group stanza,
                           reservations can be made only in group1, which has an allotment of six
                           reservations. Each reservation can have a maximum duration of 1440 minutes
                           (24 hours).

136   TWS LoadLeveler: Using and Administering
Considering only the user-stanza attributes for reservations:
                    – User Carol can make up to four reservations with each having a maximum
                        duration of 720 minutes (12 hours).
                    – User Dave can make up to four reservations with each having a maximum
                        duration of 2880 minutes (48 hours).
                    If there are no reservations in the system and user Carol wants to make four
                    reservations, she may do so. Each reservation can have a maximum duration of
                    no more than 720 minutes. If Carol attempts to make a reservation with a
                    duration greater than 720 minutes, LoadLeveler will not make the reservation
                    because it exceeds the duration allowed for Carol.
                    Assume that Carol has created four reservations, and user Dave now wants to
                    create four reservations:
                    – The number of reservations Dave may make is limited by the state of Carol’s
                        reservations and the maximum limit on reservations for group1. If the four
                        reservations Carol made are still being set up, or are active, active shared or
                        waiting, LoadLeveler will restrict Dave to making only two reservations at
                        this time.
                    – Because the value of max_reservation_duration for the group is more
                        restrictive than max_reservation_duration for user Dave, LoadLeveler
                        enforces the group value, 1440 minutes.
                    If Dave belonged to another group that still had reservations available, then he
                    could make reservations under that group, assuming the maximum number of
                    reservations for the cluster had not been met. However, in this example, Dave
                    cannot make any further reservations because they are allowed in group1 only.

    Steps for integrating LoadLeveler with the AIX Workload Manager
|                 Another administrative setup task you must consider is whether you want to
|                 enforce resource usage of ConsumableCpus, ConsumableMemory,
|                 ConsumableVirtualMemory, and ConsumableLargePageMemory.

|                 If you want to control these resources, AIX Workload Manager (WLM) can be
|                 integrated with LoadLeveler to balance workloads at the machine level. When you
|                 are using WLM, workload balancing is done by assigning relative priorities to job
|                 processes. These job priorities prevent one job from monopolizing system resources
|                 when that resource is under contention.

|                 Note: WLM is not supported in LoadLeveler for Linux.

|                 To integrate LoadLeveler and WLM, perform the following steps:
|                 1. As required for your use, define the applicable options for ConsumableCpus,
|                     ConsumableMemory, ConsumableVirtualMemory, or
|                     ConsumableLargePageMemory as consumable resources in the
|                     SCHEDULE_BY_RESOURCES global configuration keyword. This enables the
|                     LoadLeveler scheduler to consider these consumable resources.
|                 2. As required for your use, define the applicable options for ConsumableCpus,
|                     ConsumableMemory, ConsumableVirtualMemory, or
|                     ConsumableLargePageMemory in the ENFORCE_RESOURCE_USAGE global
|                     configuration keyword. This enables enforcement of these consumable resources
|                     by AIX WLM.
|                 3. Define hard, soft or shares in the ENFORCE_RESOURCE_POLICY
|                     configuration keyword. This defines what policy is used by LoadLeveler for
|                     CPUs and real memory when setting WLM class resource entitlements.

                                                      Chapter 6. Performing additional administrator tasks   137
4. (Optional) Set the ENFORCE_RESOURCE_MEMORY configuration keyword
                               to true. This setting allows AIX WLM to limit the real memory usage of a
                               WLM class as precisely as possible. When a class exceeds its limit, all processes
                               in the class are killed.
                               Rule: ConsumableMemory must be defined in the
                               ENFORCE_RESOURCE_USAGE keyword in the global configuration file, or
                               LoadLeveler does not consider the ENFORCE_RESOURCE_MEMORY
                               keyword to be valid.
                               Tips:
                               v When set to true, the ENFORCE_RESOURCE_MEMORY keyword overrides
                                  the policy set through the ENFORCE_RESOURCE_POLICY keyword for
                                  ConsumableMemory only. The ENFORCE_RESOURCE_POLICY keyword
                                  value still applies for ConsumableCpus.
                               v ENFORCE_RESOURCE_MEMORY may be set in either the global or the
                                  local configuration file. In the global configuration file, this keyword sets the
                                  default value for all the machines in the LoadLeveler cluster. If the keyword
                                  also is defined in a local file, the local setting overrides the global setting.
|                           5. Using the resources keyword in a machine stanza in the administration file,
|                              define the CPU, real memory, virtual memory, and large page machine
|                              resources available for user jobs.
                               v The ConsumableCpus reserved word accepts a count value of ″all.″ This
                                  indicates that the initial resource count will be obtained from the Startd
                                  machine update value for CPUs.
                               v If no resources are defined for a machine, then no enforcement will be done
                                  on that machine.
                               v If the count specified by the administrator is greater than what the Startd
                                  update indicates, the initial count value will be reduced to match what the
                                  Startd reports.
|                              v For CPUs and real memory, if the count specified by the administrator is less
|                                 than what the Startd update indicates, the WLM resource shares assigned to
|                                 a job will be adjusted to represent that difference. In addition, a WLM
|                                 softlimit will be defined for each WLM class. For example, if the
|                                 administrator defines 8 CPUs on a 16 CPU machine, then a job requesting 4
|                                 CPUs will get a share of 4 and a softlimit of 50%.
                               v Use caution when determining the amount of real memory available for user
                                  jobs. A certain percentage of a machine’s real memory will be dedicated to
                                  the Default and System WLM classes and will not be included in the
                                  calculation of real memory available for users jobs. Start LoadLeveler with
                                  the ENFORCE_RESOURCE_USAGE keyword enabled and issue wlmstat -v
                                  -m. Look at the npg column to determine how much memory is being used
                                  by these classes.
|                              v ConsumableVirtualMemory and ConsumableLargePageMemory are hard
|                                 max limit values.
|                                 – AIX WLM considers the ConsumableVirtualMemory value to be real
|                                    memory plus large page plus swap space.
|                                 – The ConsumableLargePageMemory value should be a value equal to the
|                                    multiple of the pagesize. For example, 16MB (page size) * 4 pages = 64MB.
|                           6. Decide if all jobs should have their CPU, real memory, virtual memory, or large
|                              page resources enforced and then define the
                               ENFORCE_RESOURCE_SUBMISSION global configuration keyword.
                               v If the value specified is true, LoadLeveler will check all jobs at submission
                                  time for the resources and node_resources keywords. To be submitted, either
                                  the job’s resources or node_resources keyword must have the same
                                  resources specified as the ENFORCE_RESOURCE_USAGE keyword.


    138   TWS LoadLeveler: Using and Administering
v If the value specified is false, no checking is performed and jobs submitted
                   without the resources or node_resources keyword will not have resources
                   enforced and it might interfere with other jobs whose resources are enforced.
                 v To support existing job command files without the resources or
                   node_resources keyword, the default_resources and default_node_resources
                   keywords in the class stanza can be defined.

              For more information on the ENFORCE_RESOURCE_USAGE and the
              ENFORCE_RESOURCE_SUBMISSION keywords, see “Defining usage policies
              for consumable resources” on page 60.

LoadLeveler support for checkpointing jobs
              Checkpointing is a method of periodically saving the state of a job step so that if
              the step does not complete it can be restarted from the saved state.

              When checkpointing is enabled, checkpoints can be initiated from within the
              application at major milestones, or by the user, administrator or LoadLeveler
              external to the application. Both serial and parallel job steps can be checkpointed.

              Once a job step has been successfully checkpointed, if that step terminates before
              completion, the checkpoint file can be used to resume the job step from its saved
              state rather than from the beginning. When a job step terminates and is removed
              from the LoadLeveler job queue, it can be restarted from the checkpoint file by
              submitting a new job and setting the restart_from_ckpt = yes job command file
              keyword. When a job is terminated and remains on the LoadLeveler job queue,
              such as when a job step is vacated, the job step will automatically be restarted
              from the latest valid checkpoint file. A job can be vacated as a result of flushing a
              node, issuing checkpoint and hold, stopping or recycling LoadLeveler or as the
              result of a node crash.

              To find out more about checkpointing jobs, use the information in Table 30.
              Table 30. Roadmap of tasks for checkpointing jobs
              Subtask                        Associated instructions (see . . . )
              Preparing the LoadLeveler      v “Checkpoint keyword summary”
              environment for
                                             v “Planning considerations for checkpointing jobs” on page
              checkpointing and restarting
                                               140
              jobs
                                             v “AIX checkpoint and restart limitations” on page 141
                                             v “Naming checkpoint files and directories” on page 145
              Checkpointing and restarting   v “Checkpointing a job” on page 232
              jobs
                                             v “Removing old checkpoint files” on page 146
              Correctly specifying           v Chapter 12, “Configuration file reference,” on page 263
              configuration and
                                             v Chapter 13, “Administration file reference,” on page 321
              administration file keywords



        Checkpoint keyword summary
              There are keywords associated with the checkpoint and restart function.

              The following is a summary of keywords associated with the checkpoint and
              restart function.
              v Configuration file keywords

                                                    Chapter 6. Performing additional administrator tasks   139
–   CKPT_CLEANUP_INTERVAL
                           –   CKPT_CLEANUP_PROGRAM
                           –   CKPT_EXECUTE_DIR
                           –   MAX_CKPT_INTERVAL
                           –   MIN_CKPT_INTERVAL
                          For more information about these keywords, see Chapter 12, “Configuration file
                          reference,” on page 263.
                        v Administration file keywords
                          – ckpt_dir
                          – ckpt_time_limit
                           For more information about these keywords, see Chapter 13, “Administration file
                           reference,” on page 321.
                        v Job command file keywords
                          – checkpoint
                          – ckpt_dir
                          – ckpt_execute_dir
                          – ckpt_file
                          – ckpt_time_limit
                          – restart_from_ckpt
                           For more information about these keywords, see “Job command file keyword
                           descriptions” on page 359.

             Planning considerations for checkpointing jobs
                        There are guidelines to review before you submit a checkpointing job.

                        Review the following guidelines before you submit a checkpointing job:
                        v Plan for jobs that you will restart on different nodes
                          If you plan to migrate jobs (restart jobs on a different node or set of nodes), you
                          should understand the difference between writing checkpoint files to a local file
                          system versus a global file system (such as AFS or GPFS™). The ckpt_file and
                          ckpt_dir keywords in the job command and configuration files allow you to
                          write to either type of file system. If you are using a local file system, before
                          restarting the job from checkpoint, make certain that the checkpoint files are
                          accessible from the machine on which the job will be restarted.
                        v Reserve adequate disk space
                          A checkpoint file requires a significant amount of disk space. The checkpoint
                          will fail if the directory where the checkpoint file is written does not have
                          adequate space. For serial jobs, one checkpoint file will be created. For parallel
                          jobs, one checkpoint file will be created for each task. Since the old set of
                          checkpoint files are not deleted until the new set of files are successfully created,
                          the checkpoint directory should be large enough to contain two sets of
                          checkpoint files. You can make an accurate size estimate only after you have run
                          your job and noticed the size of the checkpoint file that is created.
                        v Plan for staging executables
                          If you want to stage the executable for a job step, use the ckpt_execute_dir
                          keyword to define the directory where LoadLeveler will save the executable.
                          This directory cannot be the same as the current location of the executable file,
                          or LoadLeveler will not stage the executable.
                          You may define the ckpt_execute_dir keyword in either the configuration file or
                          the job command file. To decide where to define the keyword, use the
                          information in Table 31 on page 141.


140   TWS LoadLeveler: Using and Administering
Table 31. Deciding where to define the directory for staging executables
          If the ckpt_execute_dir
          keyword is defined in:      Then the following information applies:
          The configuration file only v LoadLeveler stages the executable file in a new subdirectory
                                        of the specified directory. The name of the subdirectory is the
                                        job step ID.
                                      v The user is the owner of the subdirectory and has permission
                                        700.
                                      v If the user issues the llckpt command with the -k option,
                                        LoadLeveler deletes the staged executable.
                                      v LoadLeveler will delete the subdirectory and the staged
                                        executable when the job step ends.
          The job command file only v LoadLeveler stages the executable file in the directory
                                       specified in the job command file.
                                     v The user is the owner of the file and has execute permission
          Both the configuration and   for it.
          job command files          v The user is responsible for deleting the staged file after the
                                       job step ends.
          Neither file (the keyword   LoadLeveler does not stage the executable file for the job step.
          is not defined)

          v Set your checkpoint file size to the maximum
            To make sure that your job can write a large checkpoint file, assign your job to a
            job class that has its file size limit set to the maximum (unlimited). In the
            administration file, set up a class stanza for checkpointing jobs with the
            following entry:
              file_limit = unlimited,unlimited

            This statement specifies that there is no limit on the maximum size of a file that
            your program can create.
          v Choose a unique checkpoint file name
            To prevent another job step from writing over your checkpoint file with another
            checkpoint file, make certain that your checkpoint file name is unique. The
            ckpt_dir and ckpt_file keywords give you control over the location and name of
            these files.
            For mode information, see “Naming checkpoint files and directories” on page
            145.

    AIX checkpoint and restart limitations
          There are limitations associated with checkpoint and restart.
          v The following items cannot be checkpointed:
            – Programs that are being run under:
               - The dynamic probe class library (DPCL).
               - Any debugger.
            – MPI programs that are not compiled with mpcc_r, mpCC_r, mpxlf_r,
               mpxlf90_r, or mpxlf95_r.
            – Processes that use:
               - Extended shmat support
               - Pinned shared memory segments
|              - The debug malloc tool (MALLOCTYPE=debug)
            – Sets of processes in which any process is running a setuid program when a
               checkpoint occurs.
            – Sets of processes if any process is running a setgid program when a
               checkpoint occurs.

                                                 Chapter 6. Performing additional administrator tasks   141
– Interactive parallel jobs for which POE input or output is a pipe.
                            – Interactive parallel jobs for which POE input or output is redirected, unless
                                the job is submitted from a shell that had the CHECKPOINT environment
                                variable set to yes before the shell was started. If POE is run from inside a
                                shell script and is run in the background, the script must be started from a
                                shell started in the same manner for the job to be checkpointable.
                            – Interactive POE jobs for which the su command was used prior to
                                checkpointing or restarting the job.
                        v   The node on which a process is restarted must have:
                            – The same operating system level (including PTFs). In addition, a restarted
                                process may not load a module that requires a system call from a kernel
                                extension that was not present at checkpoint time.
                            – The same switch type as the node where the checkpoint occurred.
                            If any threads in a process were bound to a specific processor ID at checkpoint
                            time, that processor ID must exist on the node where that process is restarted.
                        v   If the LoadLeveler cluster contains nodes running a mix of 32-bit and 64-bit
                            kernels then applications must be checkpointed and restarted on the same set of
                            nodes. For more information, see “llckpt - Checkpoint a running job step” on
                            page 430 and the restart_on_same_nodes keyword description.
                        v   For a parallel job, the number of tasks and the task geometry (the tasks that are
                            common within a node) must be the same on a restart as it was when the job
                            was checkpointed.
                        v   Any regular file open in a process when it is checkpointed must be present on
                            the node where that process is restarted, including the executable and any
                            dynamically loaded libraries or objects.
                        v   If any process uses sockets or pipes, user callbacks should be registered to save
                            data that may be ″in flight″ when a checkpoint occurs, and to restore the data
                            when the process is resumed after a checkpoint or restart. Similarly, any user
                            shared memory in a parallel task should be saved and restored.
                        v A checkpoint operation will not begin on a process until each user thread in that
                          process has released all pthread locks, if held. This can potentially cause a
                          significant delay from the time a checkpoint is issued until the checkpoint
                          actually occurs. Also, any thread of a process that is being checkpointed that
                          does not hold any pthread locks and tries to acquire one will be stopped
                          immediately. There are no similar actions performed for atomic locks
                          (_check_lock and _clear_lock, for example).
                        v Atomic locks must be used in such a way that they do not prevent the releasing
                          of pthread locks during a checkpoint. For example, if a checkpoint occurs and
                          thread 1 holds a pthread lock and is waiting for an atomic lock, and thread 2
                          tries to acquire a different pthread lock (and does not hold any other pthread
                          locks) before releasing the atomic lock that is being waited for in thread 1, the
                          checkpoint will hang.
                        v A process must not hold a pthread lock when creating a new process (either
                          implicitly using popen, for example, or explicitly using fork) if releasing the lock
                          is contingent on some action of the new process. Otherwise, a checkpoint could
                          occur which would cause the child process to be stopped before the parent
                          could release the pthread lock causing the checkpoint operation to hang.
                        v The checkpoint operation will hang if any user pthread locks are held across:
                          – Any collective communication calls in MPI or LAPI
                          – Calls to mpc_init_ckpt or mp_init_ckpt
                        v Processes cannot be profiled at the time a checkpoint is taken.
                        v There can be no devices other than TTYs or /dev/null open at the time a
                          checkpoint is taken.

142   TWS LoadLeveler: Using and Administering
v Open files must either have an absolute path name that is less than or equal to
      PATHMAX in length, or must have a relative path name that is less than or
      equal to PATHMAX in length from the current directory at the time they were
      opened. The current directory must have an absolute path name that is less than
      or equal to PATHMAX in length.
    v Semaphores or message queues that are used within the set of processes being
      checkpointed must only be used by processes within the set of processes being
      checkpointed. This condition is not verified when a set of processes is
      checkpointed. The checkpoint and restart operations will succeed, but
      inconsistent results can occur after the restart.
    v The processes that create shared memory must be checkpointed with the
      processes using the shared memory if the shared memory is ever detached from
      all processes being checkpointed. Otherwise, the shared memory may not be
      available after a restart operation.
    v The ability to checkpoint and restart a process is not supported for B1 and C2
      security configurations.
    v A process can only checkpoint another process if it can send a signal to the
      process. In other words, the privilege checking for checkpointing processes is
      identical to the privilege checking for sending a signal to the process. A
      privileged process (the effective user ID is 0) can checkpoint any process. A set
      of processes can only be checkpointed if each process in the set can be
      checkpointed.
    v A process can only restart another process if it can change its entire privilege
      state (real, saved, and effective versions of user ID, group ID, and group list) to
      match that of the restarted process. A set of processes can only be restarted if
      each process in the set can be restarted.
    v The only DCE function supported is DCE credential forwarding by LoadLeveler
      using the DCE_AUTHENTICATION_PAIR configuration keyword. DCE
      credential forwarding is for the sole purpose of DFS™ access by the application.
    v If a process invokes any Network Information Service (NIS) functions, from then
      on, AIX will delay the start of a checkpoint of a process until the process returns
      from any system calls.
    v Jobs in which the message passing application is not a direct child of the
      Partition Manager Daemon (pmd) cannot be checkpointed.
|   v Scale-across jobs cannot be checkpointed.
    v The following functions will return ENOTSUP if called in a job that has enabled
      checkpointing:
      – clock_getcpuclockid()
      – clock_getres()
      – clock_gettime()
      – clock_nanosleep()
      – clock_settime()
      – mlock()
      – mlockall()
      – mq_close()
      – mq_getattr()
      – mq_notify()
      – mq_open()
      – mq_receive()
      – mq_send()
      – mq_setattr()
      – mq_timedreceive()
      – mq_timedsend()


                                       Chapter 6. Performing additional administrator tasks   143
–   mq_unlink()
                           –   munlock()
                           –   munlockall()
                           –   nanosleep()
                           –   pthread_barrier_destroy()
                           –   pthread_barrier_init()
                           –   pthread_barrier_wait()
                           –   pthread_barrierattr_destroy()
                           –   pthread_barrierattr_getpshared()
                           –   pthread_barrierattr_init()
                           –   pthread_barrierattr_setpshared()
                           –   pthread_condattr_getclock()
                           –   pthread_condattr_setclock()
                           –   pthread_getcpuclockid()
                           –   pthread_mutex_getprioceiling()
                           –   pthread_mutex_setprioceiling()
                           –   pthread_mutex_timedlock()
                           –   pthread_mutexattr_getprioceiling()
                           –   pthread_mutexattr_getprotocol()
                           –   pthread_mutexattr_setprioceiling()
                           –   pthread_mutexattr_setprotocol()
                           –   pthread_rwlock_timedrdlock()
                           –   pthread_rwlock_timedwrlock()
                           –   pthread_setschedprio()
                           –   pthread_spin_destroy()
                           –   pthread_spin_init()
                           –   pthread_spin_lock()
                           –   pthread_spin_trylock()
                           –   pthread_spin_unlock()
                           –   sched_get_priority_max()
                           –   sched_get_priority_min()
                           –   sched_getparam()
                           –   sched_getscheduler()
                           –   sched_rr_get_interval()
                           –   sched_setparam()
                           –   sched_setscheduler()
                           –   sem_close()
                           –   sem_destroy()
                           –   sem_getvalue()
                           –   sem_init()
                           –   sem_open()
                           –   sem_post()
                           –   sem_timedwait()
                           –   sem_trywait()
                           –   sem_unlink()
                           –   sem_wait()
                           –   shm_open()
                           –   shm_unlink()
                           –   timer_create()
                           –   timer_delete()
                           –   timer_getoverrun()
                           –   timer_gettime()
                           –   timer_settime()




144   TWS LoadLeveler: Using and Administering
Naming checkpoint files and directories
      At checkpoint time, a checkpoint file and potentially an error file will be created.

      For jobs which are enabled for checkpoint, a control file may be generated at the
      time of job submission. The directory which will contain these files must pre-exist
      and have sufficient space and permissions for these files to be written. The name
      and location of these files will be controlled through keywords in the job command
      file or the LoadLeveler configuration. The file name specified is used as a base
      name from which the actual checkpoint file name is constructed. To prevent
      another job step from writing over your checkpoint file, make certain that your
      checkpoint file name is unique. For serial jobs and the master task (POE) of
      parallel jobs, the checkpoint file name will be <basename>.Tag. For a parallel job, a
      checkpoint file is created for each task. The checkpoint file name will be
      <basename>.Taskid.Tag.

      The tag is used to differentiate between a current and previous checkpoint file. A
      control file may be created in the checkpoint directory. This control file contains
      information LoadLeveler uses for restarting certain jobs. An error file may also be
      created in the checkpoint directory. The data in this file is in a machine readable
      format. The information contained in the error file is available in mail, LoadLeveler
      logs or is output of the checkpoint command. Both of these files are named with
      the same base name as the checkpoint file with the extensions .cntl and .err,
      respectively.

      Naming checkpoint files for serial and batch parallel jobs
      There is an order in which keywords are checked to construct the full path name
      for a serial or batch checkpoint file.

      The following describes the order in which keywords are checked to construct the
      full path name for a serial or batch checkpoint file:
      v Base name for the checkpoint file name
         1. The ckpt_file keyword in the job command file
         2. The default file name [< jobname.>]<job_step_id>.ckpt
            Where:
            jobname
                     The job_name specified in the Job Command File. If job_name is not
                     specified, it is omitted from the default file name
            job_step_id
                     Identifies the job step that is being checkpointed
      v Checkpoint Directory Name
         1. The ckpt_file keyword in the job command file, if it contains a ″/″ as the first
            character
         2. The ckpt_dir keyword in the job command file
         3. The ckpt_dir keyword specified in the class stanza of the LoadLeveler admin
            file
         4. The default directory is the initial working directory

      Note that two or more job steps running at the same time cannot both write to the
      same checkpoint file, since the file will be corrupted.

      Naming checkpointing files for interactive parallel jobs
      There is an order in which keywords and variables are checked to construct the
      full path name for the checkpoint file for an interactive parallel job.




                                          Chapter 6. Performing additional administrator tasks   145
The following describes the order in which keywords and variables are checked to
                        construct the full path name for the checkpoint file for an interactive parallel job.
                        v Checkpoint File Name
                          1. The value of the MP_CKPTFILE environment variable within the POE
                              process
                          2. The default file name, poe.ckpt.<pid>
                        v Checkpoint Directory Name
                          1. The value of the MP_CKPTFILE environment variable within the POE
                              process, if it contains a full path name.
                          2. The value of the MP_CKPTDIR environment variable within the POE
                              process.
                          3. The initial working directory.

                        Note: The keywords ckpt_dir and ckpt_file are not allowed in the command file
                              for an interactive session. If they are present, they will be ignored and the
                              job will be submitted.

             Removing old checkpoint files
                        LoadLeveler provides two keywords to help automate the process of removing
                        checkpoint files that are no longer necessary.

                        To keep your system free of checkpoint files that are no longer necessary,
                        LoadLeveler provides two keywords to help automate the process of removing
                        these files:
                        v CKPT_CLEANUP_PROGRAM
                        v CKPT_CLEANUP_INTERVAL
                        Both keywords must contain valid values to automate this process. For information
                        about configuration file keyword syntax and other details, see Chapter 12,
                        “Configuration file reference,” on page 263.

LoadLeveler scheduling affinity support
                        LoadLeveler offers a number of scheduling affinity options.

                        LoadLeveler offers the following scheduling affinity options:
                        v Memory and adapter affinity
                        v Processor affinity

                        Enabling scheduling affinity allows LoadLeveler jobs to utilize performance
                        improvement from multiple chip modules (MCMs) (memory and adapter) and
                        processor affinities. If enabled, LoadLeveler will schedule and attach the
                        appropriate CPUs in the cluster to the job tasks in order to maximize performance
                        improvement based on the type of affinity requested by the job.

                        Memory and adapter affinity

                        Memory affinity is a special purpose option for improving performance on IBM
                        POWER6™, POWER5™, and POWER4™ processor-based systems. These machines
                        contain MCMs, each containing multiple processors. System memory is attached to
                        these MCMs. While any processor can access all of the memory in the system, a
                        processor has faster access and higher bandwidth when addressing memory that is
                        attached to its own MCM rather than memory attached to the other MCMs in the
                        system. The concept of affinity also applies to the I/O subsystem. The processes
                        running on CPUs from an MCM have faster access to the adapters attached to the

146   TWS LoadLeveler: Using and Administering
I/O slots of that MCM. I/O affinity will be referred to as adapter affinity in this
          topic. For more information about memory and adapter affinity, see AIX
          Performance Management Guide.

|         Processor affinity

|         LoadLeveler provides processor affinity options to improve job performance on the
|         following platforms:
|         v IBM POWER6 and POWER5 processor-based systems running in simultaneous
|            multithreading (SMT) mode with AIX or Linux
|         v IBM POWER6 and POWER5 processor-based systems running in Single
|            Threaded (ST) mode with AIX or Linux
|         v IBM POWER4 processor-based systems with AIX or Linux
|         v x86 and x86_64 processor-based systems with Linux

|         On AIX, affinity support is implemented by using a Resource Set (RSet), which
|         contains bit maps for CPU and memory pool resources. The RSet APIs available in
|         AIX can be used to attach RSets to processes. Attaching an RSet to a process limits
|         the process to only using the resources contained in the RSet. One of the main uses
|         of RSets is to limit the application processes to run only on the processors
|         contained in a single MCM and hence to benefit from memory affinity. For more
|         details on RSets, refer to AIX System Management Guide: Operating System and
|         Devices.

|         On Linux on Power systems, affinity support is implemented by using ″cpusets,″
|         which provide a mechanism for assigning a set of CPUs and memory nodes
|         (MCMs) to a set of tasks. The cpusets constrain the CPU and memory placement of
|         tasks to only the resources within a task’s current cpuset. The cpusets are managed
|         by the virtual file system type cpuset. Before configuring LoadLeveler to support
|         affinity, the cpuset virtual file system must be created on every machine in the
|         cluster to enable affinity support.

|         On Linux on x86 and x86_64 systems, affinity support is implemented by using the
|         sched_setaffinity Linux-specific system call to assign a set of physical or logical
|         CPUs to the job processes.

    Configuring LoadLeveler to use scheduling affinity
          On AIX and Linux on Power systems, scheduling affinity can be enabled by using
          the RSET_SUPPORT configuration file keyword. Machines that are configured
          with this keyword indicate the ability to service jobs requesting or requiring
          scheduling affinity.

|         Enable RSET_SUPPORT with one of these values:
|         v Choose RSET_MCM_AFFINITY to allow jobs specifying rset =
|           RSET_MCM_AFFINITY or the task_affinity keyword to run on a node. When
|           rset = RSET_MCM_AFFINITY, LoadLeveler will select and attach sets of CPUs
|           to task processes such that a set of CPUs will be from the same MCM. When the
|           task_affinity keyword is used, LoadLeveler will select CPUs regardless of their
|           location with respect to an MCM.
|         v Choose RSET_USER_DEFINED to allow jobs specifying a user-defined RSet
|           name for rset to run on a node. The RSET_USER_DEFINED option enables
|           scheduling affinity, allowing users more control over scheduling affinity
|           parameters by allowing the use of user-defined RSets. Through the use of
|           user-defined RSets, users can utilize new RSet features before a LoadLeveler

                                              Chapter 6. Performing additional administrator tasks   147
|                              implementation is released. This option also allows users to specify a different
|                              number of CPUs in their RSets depending on the needs of each task. This value
|                              is supported only on AIX machines.

                            Note:
|                                   1. Because LoadLeveler creates a cpuset for each task requesting affinity
|                                      under the /dev/cpuset directory on Linux on POWER machines, the
|                                      cpuset virtual file system must be created and mounted on the
|                                      /dev/cpuset directory by issuing the following commands on each node:
|                                      # mkdir /dev/cpuset
|                                      # mount -t cpuset none /dev/cpuset
|                                   2. A virtual file system of type cpuset mounted at /dev/cpuset will be
|                                      deleted when the node is rebooted. To create the /dev/cpuset directory
|                                      and have the virtual cpuset file system mounted on it automatically
|                                      when the node is rebooted, add the following commands to your
|                                      start-up script (for example, /etc/init.d/boot.local), which is run when the
|                                      node is rebooted or started:
|                                      if test -e /dev/cpuset || mkdir -p /dev/cpuset ; then
|                                       mount -t cpuset none /dev/cpuset
|                                      fi

|                           See “Configuration file keyword descriptions” on page 265 for more information
|                           on the RSET_SUPPORT keyword.

|                           On AIX and Linux on Power systems, jobs requesting processor affinity with the
|                           task_affinity keyword in the job command file will only run on machines where
|                           the resource statement in the machine stanza in the LoadLeveler administration file
|                           contains the ConsumableCpus keyword. For more information on specifying
|                           ConsumableCpus, see the resource keyword description in “Administration file
|                           keyword descriptions” on page 327.

|                           On Linux on x86 and x86_64 systems, exclusive allocation of CPUs to job steps is
|                           enabled by using the ALLOC_EXCLUSIVE_CPU_PER_JOB configuration file
|                           keyword. Enable ALLOC_EXCLUSIVE_CPU_PER_JOB with one of these values:
|                           v Choose the PHYSICAL option to allow LoadLeveler to assign tasks to physical
|                             processor packages. The PHYSICAL option allows LoadLeveler to treat
|                             hyperthreaded processors and multicore processors as a single unit so that a job
|                             has dedicated computing resources. For example, a node with two Intel x86
|                             processors with hyperthreading turned ON, will be treated as a node with two
|                             physical processors. Similarly, a node with two dual-core AMD Opteron
|                             processors will be treated as a node with two physical processors.
|                           v Choose the LOGICAL option to allow LoadLeveler to assign tasks to processor
|                             units. For example, a node with two Intel x86 processors with hyperthreading
|                             turned ON will be treated as a node with four processors. A node with two
|                             dual-core AMD Opteron processors will be treated as a node with four
|                             processors.

|                           See “Configuration file keyword descriptions” on page 265 for more information
|                           on the ALLOC_EXCLUSIVE_CPU_PER_JOB keyword.

    LoadLeveler multicluster support
                            To provide a more scalable runtime environment and more efficient workload
                            balancing, you may configure a LoadLeveler multicluster environment.


    148   TWS LoadLeveler: Using and Administering
A LoadLeveler multicluster environment consists of two or more LoadLeveler
    clusters, grouped together through network connections that allow the clusters to
    share resources. These clusters may be AIX, Linux, or mixed clusters.

    Within a LoadLeveler multicluster environment:
    v The local cluster is the cluster from which the user submits jobs or issues
      commands.
    v A remote cluster is a cluster that accepts job submissions and commands from
      the local cluster.
    v A local gateway Schedd is a Schedd within the local cluster serving as an
      inbound point from some remote cluster, an outbound point to some remote
      cluster, or both.
    v A remote gateway Schedd is a Schedd within a remote cluster serving as an
      inbound point from the local cluster, an outbound point to the local cluster, or
      both.
    v A local central manager is the central manager in the same cluster as the local
      gateway Schedd.
    v A remote central manager is the central manager in the same cluster as a remote
      gateway Schedd.

    A LoadLeveler multicluster environment addresses scalability and workload
    balancing issues by providing the ability to:
    v Distribute workload among LoadLeveler clusters when jobs are submitted.
    v Easily access multiple LoadLeveler cluster resources.
    v Display information about the multicluster.
    v Monitor and control operations in a multicluster.
    v Transfer idle jobs from one cluster to another.
    v Transfer user input and output files between clusters.
    v Enable LoadLeveler to operate in a secure environment where clusters are
      separated by a firewall.

    Table 32 shows the multicluster support subtasks with a pointer to the associated
    instructions:
    Table 32. Multicluster support subtasks and associated instructions
    Subtask                                          Associated instructions (see . . . )
    Configure a LoadLeveler multicluster             “Configuring a LoadLeveler multicluster” on
                                                     page 150
    Submit and monitor jobs in a LoadLeveler         “Submitting and monitoring jobs in a
    multicluster                                     LoadLeveler multicluster” on page 223
|   Scale-across scheduling                          “Scale-across scheduling with multiclusters”
                                                     on page 153


    Table 33. Multicluster support related topics
    Related topics                                   Additional information (see . . . )
    Administration file: Cluster stanzas             “Defining clusters” on page 100
    Administration file: Cluster keywords            “Administration file keyword descriptions”
                                                     on page 327
    Configuration file: Cluster keywords             “Configuration file keyword descriptions”
                                                     on page 265
    Job command file: Cluster keywords               “Job command file keyword descriptions” on
                                                     page 359



                                            Chapter 6. Performing additional administrator tasks   149
Table 33. Multicluster support related topics (continued)
                        Related topics                                    Additional information (see . . . )
                        Commands and APIs                                 Chapter 16, “Commands,” on page 411 or
                                                                          Chapter 17, “Application programming
                                                                          interfaces (APIs),” on page 541
                        Diagnosis and messages                            TWS LoadLeveler: Diagnosis and Messages
                                                                          Guide



             Configuring a LoadLeveler multicluster
                        These are the subtasks for configuring a LoadLeveler multicluster.

                        Table 34 lists the subtasks for configuring a LoadLeveler multicluster.
                        Table 34. Subtasks for configuring a LoadLeveler multicluster
                        Subtask                  Associated instructions (see . . . )
                        Configure the            v “Steps for configuring a LoadLeveler multicluster” on page 151
                        LoadLeveler
                                                 v “Steps for securing communications within a LoadLeveler
                        multicluster
                                                   multicluster” on page 153
                        environment
                        Display information   v Use the llstatus command:
                        about the LoadLeveler   – With the -X option to display information about machines in
                        multicluster              the multicluster.
                        environment             – With the -C option to display information defined in cluster
                                                  stanzas in the administration file.
                                                 v Use the llclass command with the -X option to display
                                                   information about classes on any cluster (local or remote).
                                                 v Use the llq command with the -X option to display information
                                                   about jobs on any cluster (local or remote).




150   TWS LoadLeveler: Using and Administering
Table 34. Subtasks for configuring a LoadLeveler multicluster (continued)
Subtask                 Associated instructions (see . . . )
Monitor and control     Existing LoadLeveler user commands accept the -X option for a
operations in the       multicluster environment.
LoadLeveler
multicluster            Rules:
environment             v Administrator only commands are not applicable in a multicluster
                          environment.
                        v The options -x, -W, -s, and -p cannot be specified together with
                          the -X option on the llmodify command.
                        v The options -x and -w cannot be specified together with the -X
                          option on the llq command.
                        v The -X option on the following commands is restricted to a single
                          cluster:
                          – llcancel
                          – llckpt
                          – llhold
                          – llmodify
                          – llprio
                        v The following commands are not applicable in a multicluster
                          environment:
                          – llacctmrg
                          – llchres
                          – llextRPD
                          – llinit
                          – llmkres
                          – llqres
                          – llrmres
                          – llrunscheduler
                          – llsummary


Steps for configuring a LoadLeveler multicluster
The primary task for configuring a LoadLeveler multicluster environment is to
enable communication between gateway Schedd daemons on all of the clusters in
the multicluster.

To do so requires defining each Schedd daemon as either local or remote, and
defining the inbound and outbound hosts with which the daemon will
communicate.

Before you begin: You need to know that:
v A single machine may be defined as an inbound or outbound host, or as both.
v A single cluster must belong to only one multicluster.
v A single multicluster must consist of 10 or fewer clusters.
v Clusters must have unique host names within the multicluster network domain
  space.
v The inbound Schedd becomes the schedd_host of all remote jobs it receives.

Perform the following steps to configure a LoadLeveler multicluster:
1. In the administration file, define one cluster stanza for each cluster in the
   LoadLeveler multicluster environment.
   Rules:
   v You must define one cluster as the local cluster.
   v You must code the following required cluster-stanza keywords and variable
      values:

                                       Chapter 6. Performing additional administrator tasks   151
cluster_name: type=cluster
                              outbound_hosts = hostname[(cluster_name)]
                              inbound_hosts = hostname[(cluster_name)]
                           v If you want to allow users to submit remote jobs to the local cluster, the list
                              of inbound hosts must include the name of the inbound Schedd and the
                              cluster you are defining as remote or you must specify the name of an
                              inbound Schedd without any cluster specification so that it defaults to being
                              an inbound Schedd for all clusters.
                           v If the configuration file keyword SCHEDD_STREAM_PORT for any cluster
                              is set to use a port other than the default value of 9605, you must set the
                              inbound_schedd_port keyword in the cluster stanza for that cluster.
                        2. (Optional) If the local cluster wants to provide job distribution where users
                           allow LoadLeveler to select the appropriate cluster for job submission based on
                           administration defined objectives, then define an installation exit to be executed
                           at submit time using the CLUSTER_METRIC configuration keyword. You can
                           use the LoadLeveler data access APIs in this exit to query other clusters for
                           information about possible metrics, such as the number of jobs in a specified
                           job class, the number of jobs in the idle queue, or the number of free nodes in
                           the cluster. For more detailed information, see CLUSTER_METRIC.
                           Tip: LoadLeveler provides a set of sample exits for you to use as models. These
                           samples are in the ${RELEASEDIR}/samples/llcluster directory.
                        3. (Optional) If the local cluster wants to perform user mapping on jobs arriving
                           from remote clusters, define the CLUSTER_USER_MAPPER configuration
                           keyword. For more information, see CLUSTER_USER_MAPPER.
                        4. (Optional) If the local cluster wants to perform job filtering on jobs received
                           from remote clusters, define the CLUSTER_REMOTE_JOB_FILTER
                           configuration keyword. For more information, see
                           CLUSTER_REMOTE_JOB_FILTER.
                        5. Notify LoadLeveler daemons by issuing the llctl command with either the
                           reconfig or recycle keyword. Otherwise, LoadLeveler will not process the
                           modifications you made to the administration file.

                        Additional considerations:
                        v Remote jobs are subjected to the same configuration checks as locally submitted
                          jobs. Examples include account validation, class limits, include lists, and exclude
                          lists.
                        v Remote jobs will be processed by the local submit_filter prior to submission to a
                          remote cluster.
                        v Any tracker program specified in the API parameters will be invoked upon the
                          scheduling cluster nodes.
                        v If a step is enabled for checkpoint and the ckpt_execute_dir is not specified,
                          LoadLeveler will not copy the executable to the remote cluster, the user must
                          ensure that executable exists on the remote cluster. If the executable is not in a
                          shared file system, the executable can be copied to the remote cluster using the
                          cluster_input_file job command file keyword.
                        v If the job command file is also the executable and the job is submitted or moved
                          to a remote cluster, the $(executable) variable will contain the full path name of
                          the executable on the local cluster from which it came. This differs from the
                          behavior on the local cluster, where the $(executable) variable will be the
                          command line argument passed to the llsubmit command. If you only want the
                          file name, use the $(base_executable) variable.




152   TWS LoadLeveler: Using and Administering
Steps for securing communications within a LoadLeveler
          multicluster
          Configuring LoadLeveler to use the OpenSSL library enables it to operate in a
          secure environment where clusters are separated by a firewall.

          Perform the following steps to configure LoadLeveler to use OpenSSL in a
          multicluster environment:
          1. Install SSL using the standard platform installation process.
          2. Ensure a link exists from the installed SSL library to:
             a. /usr/lib/libssl.so for 32-bit Linux platforms.
             b. /usr/lib64/libssl.so for 64-bit Linux platforms.
             c. /usr/lib/libssl.a for AIX platforms.
          3. Create the SSL authorization keys by invoking the llclusterauth command with
             the -k option on all local gateway schedds.
             Result: LoadLeveler creates a public key, a private key, and a security certificate
             for each gateway node.
          4. Distribute the public keys to remote gateway schedds on other secure clusters.
             This is done by exchanging the public keys with the other clusters you wish to
             communicate with.
             v for AIX, public keys can be found in the /var/LoadL/ssl/id_rsa.pub file.
             v for Linux, public keys can be found in the /var/opt/LoadL/ssl/id_rsa.pub
                file.
          5. Copy the public keys of the clusters you wish to communicate with into the
             authorized_keys directory on your inbound Schedd nodes.
             v for AIX, /var/LoadL/ssl/authorized_keys
             v for Linux, /var/opt/LoadL/ssl/authorized_keys
             v The authorization key files can be named anything within the
                authorized_keys directory.
          6. Define the cluster stanzas within the LoadLeveler administration file, using the
             multicluster_security = SSL keyword. Define the keyword ssl_cipher_list if a
             specific OpenSSL cipher encryption method is desired. Use secure_schedd_port
             to define the port number to be used for secure inbound transactions to the
             cluster.
          7. Notify LoadLeveler daemons by issuing the llctl -g command with the recycle
             keyword. Otherwise, LoadLeveler will not process the modifications you made
             to the administration file.
          8. Configure firewalls to accept connections to the secure_schedd_port numbers
             you defined in the administration file.

|   Scale-across scheduling with multiclusters
|         In the multicluster environment, scale-across scheduling allows you to schedule
|         jobs across more than one cluster. This feature allows large jobs that request more
|         resources than a single cluster can provide to combine resources from more than
|         one cluster and run large jobs on the combined resources. effectively spanning
|         resources across more than one cluster.

|         By effectively spanning resources across more than one cluster, scale-across
|         scheduling also allows utilization of fragmented resources from more than one
|         cluster. Fragmented resources occur when the resources available on a single
|         cluster cannot satisfy any single job on that cluster. This feature allows any size job
|         to take advantage of these resources by combining them from multiple clusters.

                                              Chapter 6. Performing additional administrator tasks   153
|                           The following are not supported with scale-across scheduling:
|                           v Checkpointing jobs
|                           v Coscheduled jobs
|                           v Data staging jobs
|                           v Hostlist jobs
|                           v IBM Blue Gene Systems resources jobs
|                           v Interactive Parallel Operating Environment (POE)
|                           v   Multistep jobs
|                           v   Preemption of scale-across jobs
|                           v   Reservations
|                           v   Secure Sockets Layer (SSL)
|                           v   Task-geometry jobs
|                           v   User space jobs

|                           Requirements for scale-across scheduling
|                           Main Cluster
|                                 In a multicluster environment that supports scale-across scheduling, one of
|                                 the clusters in the multicluster environment must be designated as the
|                                 ″main cluster.″ The main cluster will only schedule scale-across jobs; it will
|                                 not run any jobs. Scale-across jobs will run on non-main clusters.
|                           Network Connectivity
|                                 A requirement for any cluster that will participate in scale-across
|                                 scheduling is that any node in one cluster must be able to communicate
|                                 with any other node in any other cluster that is part of the scale-across
|                                 configuration. There are two reasons for this requirement:
|                                    v Since the main cluster initiates the scale-across job, one node in the main
|                                      cluster must have connectivity to any node in any of the other clusters
|                                      where the job will run.
|                                    v Tasks of parallel applications must communicate with other tasks
|                                      running on different nodes.

|                           Configuring LoadLeveler for scale-across scheduling
|                           After you choose a set of clusters to participate in scale-across scheduling, you
|                           must designate one cluster as the main cluster. Do so by specifying a value of true
|                           in the main_scale_across_cluster keyword for that cluster’s stanza in the
|                           administration files of all scale-across clusters. The cluster that specifies this
|                           keyword as true for its own cluster stanza becomes the main cluster. Any cluster
|                           that specifies this keyword as true for another cluster stanza becomes a non-main
|                           cluster.

|                           Table 35 lists scale-across scheduling keywords:
|                           Table 35. Keywords for configuring scale-across scheduling
|                           Keyword type                    Keyword reference
|
|                           Administration file keywords       allow_scale_across_jobs cluster stanza keyword
|                                                              main_scale_across_cluster cluster stanza keyword
|                                                              allow_scale_across_jobs class stanza keyword
|
|                           Configuration file keyword         SCALE_ACROSS_SCHEDULING_TIMEOUT keyword
|



    154   TWS LoadLeveler: Using and Administering
|                Tuning considerations for scale-across scheduling
|                NEGOTIATOR_CYCLE_DELAY
|                     The value on both the main and the non-main clusters should be set to
|                     similar values to minimize the wait delays on both the main and the
|                     non-main clusters that occur when the main cluster is requesting a
|                     negotiator cycle on the non-main clusters. It is reasonable to set
|                     NEGOTIATOR_CYCLE_DELAY=1 on all clusters.
|                MAX_TOP_DOGS
|                     The maximum number of top-dog scale-across jobs allowed on the main
|                     cluster should be smaller than the maximum number of top-dog jobs
|                     allowed on the non-main clusters to allow the non-main clusters to
|                     schedule both the scale-across and regular jobs as top dogs.
|                SCALE_ACROSS_SCHEDULING_TIMEOUT
|                      The default value should be overridden only if there are non-main clusters
|                      that have extremely long dispatch cycles or that have very long
|                      NEGOTIATOR_CYCLE_DELAY values. In these cases, the
|                      SCALE_ACROSS_SCHEDULING_TIMEOUT needs to be set to a value
|                      greater than those intervals.
|
    LoadLeveler Blue Gene support
                 Blue Gene is a massively parallel system based on a scalable cellular architecture
                 which exploits a very large number of tightly interconnected compute nodes
                 (C-nodes).

|                To take advantage of Blue Gene support, you must be using the LoadLeveler
|                BACKFILL scheduler. With the BACKFILL scheduler, LoadLeveler enables the Blue
|                Gene system to take advantage of reservations that allow you to schedule when,
|                and with which resources a job will run.

                 While LoadLeveler Blue Gene support is available on all platforms, Blue Gene®/L™
                 software is only supported on IBM POWER servers running SLES 9. This limitation
                 currently restricts LoadLeveler Blue Gene/L support to SLES 9 on IBM POWER
                 servers. LoadLeveler Blue Gene®/P™ software is only supported on IBM POWER
                 servers running SLES 10. Mixed clusters of Blue Gene/L and Blue Gene/P systems
                 are not supported.

                 Terms you should know:
                 v Compute nodes, also called C-nodes, are system-on-a-chip nodes that execute at
                   most a single job at a time. All the C-nodes are interconnected in a
                   three-dimensional toroidal pattern. Each C-node has a unique address and
                   location in the three-dimensional toroidal space. Compute nodes execute the
                   jobs’ tasks. Compute nodes run a minimal custom operating system called
                   BLRTS.
                 v Front End Nodes (FEN) are machines from which users and administrators
                   interact with Blue Gene. Applications are compiled on and submitted for
                   execution in the Blue Gene core from FENs. User interactions with applications,
                   including debugging, are also performed from the FENs.
                 v The Service Node is dedicated hardware that runs software to control and
                   manage the Blue Gene system.
                 v I/O nodes are special nodes that connect the compute nodes to the outside
                   world. I/O nodes allow processes that are executing in the compute nodes to
                   perform I/O operations, such as accessing files, and to communicate with the



                                                    Chapter 6. Performing additional administrator tasks   155
job management system. Each I/O node serves anywhere from 8 to 64 C-nodes,
                            depending on the physical configuration.
                        v   mpirun is a program that is executed partly on the Front End Node, and partly
                            on the Service Node. mpirun controls and monitors the parallel Blue Gene job.
                            The mpirun program is executed by the user program that is run on the FEN by
                            LoadLeveler.
                        v   A base partition (BP) is a group of compute nodes connected in a 3D
                            rectangular pattern and their controlled I/O nodes. A base partition is one of the
                            basic allocation units for jobs. For example, an allocation for the job will require
                            at least one base partition, unless an allocation requests a small partition, in
                            which case sub base partition allocation is possible.
                        v   A small partition is a group of C-nodes which are part of one base partition.
                            Valid small partitions have size of 32 or 128 C-nodes.
                        v   A partition is a group of base partitions, switches, and switch states allocated to
                            a job. A partition is predefined or is created on demand to execute a job.
                            Partitions are physically (electronically) isolated from each other (for example,
                            messages cannot flow outside an allocated partition). A partition can have the
                            topology of a mesh or a torus.
                        v   The Control System is a component that serves as the interface to the Blue Gene
                            system. It contains persistent storage with configuration and status information
                            on the entire system. It also provides various services to perform actions on the
                            Blue Gene system, such as launching a job.
                        v   A node card is a group of 32 compute nodes within a base partition. This is the
                            minimal allocation size for a partition.
                        v   A quarter is a group of 4 node cards. This is a logical grouping of node cards
                            within a base partition. A quarter, which is 128 compute nodes, is the next
                            smallest allowed allocation size for a partition after a node card.
                        v   A switch state is a set of internal switch connections which physically ″wire″ the
                            partition. A switch has a number of incoming and outgoing wires. An internal
                            switch connection physically connects one incoming wire with one outgoing
                            wire, setting up a communication path between base partitions.

                        For more information about the Blue Gene system and Blue Gene terminology,
                        refer to IBM System Blue Gene Solution documentation. Table 36 lists the IBM
                        System Blue Gene Solution publications that are available from the IBM Redbooks®
                        Web site at the following URLs:
Table 36. IBM System Blue Gene Solution documentation
Blue Gene
System           Publication Name                                 URL
Blue Gene/P      IBM System Blue Gene Solution: Blue Gene/P       http://www.redbooks.ibm.com/abstracts/
                 System Administration                            sg247417.html
                 IBM System Blue Gene Solution: Blue Gene/P       http://www.redbooks.ibm.com/abstracts/
                 Safety Considerations                            redp4257.html
                 IBM System Blue Gene Solution: Blue Gene/P       http://www.redbooks.ibm.com/abstracts/
                 Application Development                          sg247287.html
                 Evolution of the IBM System Blue Gene Solution   http://www.redbooks.ibm.com/abstracts/
                                                                  redp4247.html




156   TWS LoadLeveler: Using and Administering
Table 36. IBM System Blue Gene Solution documentation (continued)
Blue Gene
System           Publication Name                                  URL
Blue Gene/L      IBM System Blue Gene Solution: System             http://www.redbooks.ibm.com/abstracts/
                 Administration                                    sg247178.html
                 Blue Gene/L: Hardware Overview and Planning       http://www.redbooks.ibm.com/abstracts/
                                                                   sg246796.html
                 IBM System Blue Gene Solution: Application        http://www.redbooks.ibm.com/abstracts/
                 Development                                       sg247179.html
                 Unfolding the IBM eServer™ Blue Gene Solution     http://www.redbooks.ibm.com/abstracts/
                                                                   sg246686.html


                       Table 37 lists the Blue Gene subtasks with a pointer to the associated instructions:
                       Table 37. Blue Gene subtasks and associated instructions
                       Subtask                                            Associated instructions (see . . . )
                       Configure LoadLeveler Blue Gene support            “Configuring LoadLeveler Blue Gene
                                                                          support”
                       Submit and monitor Blue Gene jobs                  “Submitting and monitoring Blue Gene jobs”
                                                                          on page 226


                       Table 38 lists the Blue Gene related topics and associated information:
                       Table 38. Blue Gene related topics and associated information
                       Related topic                                      Associated information (see . . . )
                       Configuration file: Blue Gene keywords             “Configuration file keyword descriptions”
                                                                          on page 265
                       Job command file: Blue Gene keywords               “Job command file keyword descriptions” on
                                                                          page 359
                       Commands and APIs                                  Chapter 16, “Commands,” on page 411 or
                                                                          Chapter 17, “Application programming
                                                                          interfaces (APIs),” on page 541
                       Diagnosis and messages                             TWS LoadLeveler: Diagnosis and Messages
                                                                          Guide



              Configuring LoadLeveler Blue Gene support
                       This is a list of the subtasks for configuring LoadLeveler Blue Gene support along
                       with a pointer to the associated instructions.

                       Table 39 lists the subtasks for configuring LoadLeveler Blue Gene support along
                       with a pointer to the associated instructions:
                       Table 39. Blue Gene configuring subtasks and associated instructions
                       Subtask                 Associated instructions (see . . . )
                       Configuring             “Steps for configuring LoadLeveler Blue Gene support” on page 158
                       LoadLeveler Blue
                       Gene support




                                                                 Chapter 6. Performing additional administrator tasks   157
Table 39. Blue Gene configuring subtasks and associated instructions (continued)
                        Subtask                  Associated instructions (see . . . )
                        Display information      v Use the llstatus command with the -b option to display
                        about the Blue Gene        information about the Blue Gene system. The llstatus command
                        system                     can also be used with the -B option to display information about
                                                   Blue Gene base partitions. Using llstatus with the -P option can
                                                   be used to display information about Blue Gene partitions.
                        Display information      v Use the llsummary command with the -l option to display job
                        about Blue gene jobs       resource information.
                                                 v Use the llq command with the -b option to display information
                                                   about all Blue Gene jobs.


                        Steps for configuring LoadLeveler Blue Gene support
                        The primary task for configuring LoadLeveler Blue Gene support consists of
                        setting up the environment of the LoadL_negotiator daemon, the environment of
                        any process that will run Blue Gene jobs, and the LoadLeveler configuration file.

                        Perform the following steps to configure LoadLeveler Blue Gene support:
                        1. Configure the LoadL_negotiator daemon to run on a node which has access to
                           the Blue Gene Control System.
                        2. Enable Blue Gene support by setting the BG_ENABLED configuration file
                           keyword to true.
                        3. (Optional) Set any of the following additional Blue Gene related configuration
                           file keywords which your setup requires:
                           v BG_ALLOW_LL_JOBS_ONLY
                           v BG_CACHE_PARTITIONS
                           v BG_MIN_PARTITION_SIZE
                            v CM_CHECK_USERID
                           See “Configuration file keyword descriptions” on page 265 for more
                           information on these keywords.
                        4. Set the required environment variables for the LoadL_negotiator daemon and
                           any process that will run Blue Gene jobs. You can use global profiles to set the
                           necessary environment variables for all users. Follow these steps to set
                           environment variables for a LoadLeveler daemon:
                           a. Add required environment variable settings to global profile.
                           b. Set the environment as the administrator before invoking llctl start on the
                               central manager node.
                           c. Build a shell script which sets the required environments and starts
                               LoadLeveler, which can be invoked using rsh remotely.

                            Note: Using the llctl -h or llctl -g command to start the central manager
                                  remotely will not carry the environment variables from the login session
                                  to the LoadLeveler daemons on the remote nodes.
                            v Specify the full path name of the bridge configuration file by setting the
                              BRIDGE_CONFIG_FILE environment variable. For details on the contents of
                              the bridge configuration file, see the Blue Gene/L: System Administration or
                              Blue Gene/P: System Administration book.
                              Example:
                              For ksh:
                              export BRIDGE_CONFIG_FILE=/var/bluegene/config/bridge.cfg

158   TWS LoadLeveler: Using and Administering
For csh:
               setenv BRIDGE_CONFIG_FILE=/var/bluegene/config/bridge.cfg
             v Specify the full path name of the file containing the data required to access
               the Blue Gene Control System database by setting the DB_PROPERTY
               environment variable. For details on the contents of the database property
               file, see the Blue Gene/L: System Administration or Blue Gene/P: System
               Administration book.
               Example:
               For ksh:
               export DB_PROPERTY=/var/bluegene/config/db.cfg
               For csh:
               setenv DB_PROPERTY=/var/bluegene/config/db.cfg
             v Specify the host name of the machine running the Blue Gene control system
               by setting the MMCS_SERVER_IP environment variable. For details on the
               use of this environment variable, see the Blue Gene/L: System Administration or
               Blue Gene/P: System Administration book.
               Example:
               For ksh:
               export MMCS_SERVER_IP=bluegene.ibm.com
               For csh:
               setenv MMCS_SERVER_IP=bluegene.ibm.com

    Blue Gene reservation support
|         Reservation supports Blue Gene resources including the Blue Gene compute nodes.
|         It is important to note that when the reservation includes Blue Gene nodes, it
|         cannot include conventional nodes. A front end node (FEN), which is used to start
|         a Blue Gene job, is not part of the Blue Gene resources. A Blue Gene reservation
|         only reserves Blue Gene resources and a Blue Gene job step bound to a reservation
|         uses the reserved Blue Gene resources and shares a FEN outside the reservation.

          Jobs using Blue Gene resources can be submitted to a Blue Gene reservation to run.
          A Blue Gene job step can also be used to select what Blue Gene resources to
          reserve to make sure the reservation will have enough Blue Gene resources to run
          the Blue Gene job step.

|         For more information about reservations, see “Overview of reservations” on page
|         25.

    Blue Gene fair share scheduling support
          Fair share scheduling has been extended to Blue Gene resources as well.

          The FAIR_SHARE_TOTAL_SHARES keyword in LoadL_config and the
          fair_shares keyword for the user and group stanza in LoadL_admin apply to both
          the CPU resources and the Blue Gene resources. When a Blue Gene job step ends,
          both the CPU utilization and the Blue Gene resource utilization data will be
          collected. The elapsed job running time multiplied by the number of C-nodes
          allocated to the job step (the Size Allocated field in the llq -l output) will be
          counted as the amount of Blue Gene resource used. The used shares of the Blue
          Gene resources are independent of the used shares of the CPU resources and are
          made available through the LoadLeveler variables UserUsedBgShares and
          GroupUsedBgShares. LoadLeveler variable JobIsBlueGene will indicate whether a
          job step is a Blue Gene job step or not. LoadLeveler administrators have flexibility

                                             Chapter 6. Performing additional administrator tasks   159
in specifying the behavior of fair share scheduling by using these variables in the
                        SYSPRIO expression. The llfs command and the related APIs can also handle
                        requests related to the Blue Gene resources.

                        For more information about fair share scheduling, see “Using fair share
                        scheduling.”

             Blue Gene heterogeneous memory support
                        The LoadLeveler job command file has a bg_requirements keyword that can be
                        used to specify the requirements that a Blue Gene base partition must meet to
                        execute the job step.

                        The Blue Gene compute nodes (C-nodes) in the same base partition have the same
                        amount of physical memory. The C-nodes in different base partitions might have
                        different amounts of physical memory. The bg_requirements job command file
                        keyword allows users to specify the memory requirement on the Blue Gene
                        C-nodes.

                        The bg_requirements keyword works like the requirements keyword, but it can
                        only support memory requirements and applies only to Blue Gene base partitions.
                        For a Blue Gene job step, the requirements keyword value applies to the front end
                        node needed by the job step and the bg_requirements keyword value applies to
                        the Blue Gene base partitions needed by the job step.

             Blue Gene preemption support
                        Preemption support for Blue Gene jobs has been enabled.

                        Blue Gene jobs have the same preemption support as non-Blue Gene jobs. In a
                        typical Blue Gene system, many Blue Gene jobs share the same front end node
                        while dedicated Blue Gene resources are used for each job. To avoid preempting
                        Blue Gene jobs that use different Blue Gene resources as requested by a
                        preempting job, ENOUGH instead of ALL must be used in the PREEMPT_CLASS
                        rules for Blue Gene job preemption.

                        For more information about preemption, see “Preempting and resuming jobs” on
                        page 126

             Blue Gene/L HTC partition support
                        The allocation of High Throughput Computing (HTC) partitions on Blue Gene/L is
                        supported when the LoadLeveler BG_CACHE_PARTITIONS configuration
                        keyword is set to false.

                        See the following IBM System Blue Gene Solution Redbooks (dated April 27, 2007)
                        for more information about Blue Gene/L HTC support:
                        v IBM Blue Gene/L: System Administration, SG24-7178
                        v IBM Blue Gene/L: Application Development, SG24-7179

Using fair share scheduling
                        Fair share scheduling in LoadLeveler provides a way to divide resources in a
                        LoadLeveler cluster among users or groups of users.

                        To fairly share cluster resources, LoadLeveler can be configured to allocate a
                        proportion of the resources to each user or group and to let job priorities be

160   TWS LoadLeveler: Using and Administering
adjusted based on how much of the resources have been used and when they were
      used. Generally speaking, LoadLeveler should be configured so that job priorities
      decrease for a user or group that has recently used more resources than the
      allocated proportion and job priorities should increase for a user or group that has
      not run any jobs recently.

      Administrators can configure the behavior of fair share scheduling through a set of
      configuration keywords. They can also query fair share information, save a
      snapshot of historic data, reset and restore fair share scheduling, and perform other
      functions by using the LoadLeveler llfs command, the GUI, and the corresponding
      APIs.

      Fair share scheduling also includes Blue Gene resources (see “Blue Gene fair share
      scheduling support” on page 159 for more information).

      Note: The time of day clocks on all of the nodes in the cluster must be
            synchronized in order for fair share scheduling to work properly.

      For more information, see the following:
      v “llfs - Fair share scheduling queries and operations” on page 450
      v Corresponding APIs:
        – “ll_fair_share subroutine” on page 642
        – “Data access API” on page 560
      v Keywords:
        – fair_shares
        – FAIR_SHARE_INTERVAL
        – FAIR_SHARE_INTERVAL
      v SYSPRIO expression

Fair share scheduling keywords
      The FAIR_SHARE_TOTAL_SHARES global configuration file keyword is used to
      specify the total number of shares that each type of resource is divided into.

      The fair_shares keyword in a user or group stanza in the administration file
      specifies how many shares the user or group is allocated. The ratio of the
      fair_shares keyword value in a user or group stanza over the
      FAIR_SHARE_TOTAL_SHARES keyword value defines the resource usage
      proportion for the user or group. For example, if a user is allocated one third of
      the cluster resources, then the ratio of the user’s fair_share value over the
      FAIR_SHARE_TOTAL_SHARES keyword value should be one third.

      The LoadLeveler SYSPRIO expression can be configured to let job priorities change
      to achieve the specified resource usage proportions. Besides changing job priorities,
      fair share scheduling does not change in any way how LoadLeveler schedules jobs.
      If a job can be scheduled to run, it will be run regardless of whether the owner
      and the LoadLeveler group of the job has any shares allocated or not. No matter
      how many shares are allocated to a user, if the user does not submit any jobs to
      run, then the resource usage proportion for that user cannot be achieved and other
      users might be able to use more than their allocated proportions.

      Note: The sum of all allocated shares for users or groups does not have to equal
            the value of the FAIR_SHARE_TOTAL_SHARES keyword. The share


                                          Chapter 6. Performing additional administrator tasks   161
allocation can be used as a way to prevent a single user from consuming too
                               much of the cluster resources and as a way to share the resources as fairly
                               as possible.

                        When the value of the FAIR_SHARE_TOTAL_SHARES keyword is greater than 0,
                        fair share scheduling is on, which means that resource usage data is collected
                        when every job ends, regardless of the fair_shares values for any user or group.
                        The collected usage data is converted to used shares for each user and group. The
                        llfs command can be used to display the allocated and used shares. Turning fair
                        share scheduling on does not mean that job priorities are affected by fair share
                        scheduling. You have to configure the SYSPRIO expression to let fair share
                        scheduling affect job priorities in a way that suits your needs. By default, the value
                        of the FAIR_SHARE_TOTAL_SHARES keyword is 0 and fair share scheduling is
                        disabled.

                        There is a built-in decay mechanism for the historic resource usage data that is
                        collected when jobs end, that is, the initial resource usage value becomes smaller
                        and smaller as times goes by. This decay mechanism allows the most recent
                        resource usage to have more impact on fair share scheduling. The
                        FAIR_SHARE_INTERVAL global configuration file keyword is used to specify
                        how fast the decay is. The shorter the interval, the faster the historic data decays.
                        A resource usage value decays to 5% of its initial value after an elapsed time
                        period of the same length as the FAIR_SHARE_INTERVAL value. Generally, the
                        interval should be at least several times larger than the typical job running time in
                        the cluster to get stable results. A value should be chosen corresponding to how
                        long the historic resource usage data should have an impact on the current job
                        priorities.

                        The LoadLeveler SYSPRIO expression is used to calculate job priorities. A set of
                        LoadLeveler variables including some related to fair share scheduling can be used
                        in the SYSPRIO expression in the global configuration file. You can define the
                        SYSPRIO expression to let fair share scheduling influence the job priorities in a
                        way that is suitable to your needs. For more information, see the SYSPRIO
                        expression in Chapter 12, “Configuration file reference,” on page 263.

                        When the GroupTotalShares, GroupUsedShares, UserTotalShares,
                        UserUsedShares, UserUsedBgShares, GroupUsedBgShares, and JobIsBlueGene
                        and their corresponding user-defined variables are used, you must use the
                        NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL global configuration
                        keyword to specify a time interval at which the job priorities will be recalculated
                        using the most recent share usage information.

                        You can add the following user-defined variables to the LoadL_config global
                        configuration file to make it easier to specify fair share scheduling in the SYSPRIO
                        expressions:
                        v GroupRemainingShares = (GroupTotalShares - GroupUsedShares)
                        v GroupHasShares = ($(GroupRemainingShares) > 0)
                        v GroupSharesExceeded = ($(GroupRemainingShares) <= 0)
                        v UserRemainingShares = (UserTotalShares - UserUsedShares)
                        v UserHasShares = ($(UserRemainingShares) > 0)
                        v UserSharesExceeded = ($(UserRemainingShares) <= 0)
                        v UserRemainingBgShares = ( UserTotalShares - UserUsedBgShares)
                        v UserHasBgShares = ( $(UserRemainingBgShares) > 0)
                        v UserBgSharesExceeded = ( $(UserRemainingBgShares) <= 0)
                        v GroupRemainingBgShares = ( GroupTotalShares - GroupUsedBgShares)
                        v GroupHasBgShares = ( $(GroupRemainingBgShares) > 0)

162   TWS LoadLeveler: Using and Administering
v GroupBgSharesExceeded = ( $(GroupRemainingBgShares) <= 0)
      v JobIsNotBlueGene = ! JobIsBlueGene

      If fair share scheduling is not turned on, either because the
      FAIR_SHARE_INTERVAL keyword value is not positive or because the scheduler
      type is not BACKFILL, then the variables will have the following values:
      GroupTotalShares: 0
      GroupUsedShares: 0
      $(GroupRemainingShares): 0
      $(GroupHasShares): 0
      $(GroupSharesExceeded): 1
      UserUsedBgShares: 0
      $(UserRemainingBgShares): 0
      $(UserHasBgShares): 0
      $(UserBgSharesExceeded): 1

      If a user has the fair_shares keyword set to 10 in its user stanza and the user has
      used up 8 CPU shares and 3 Blue Gene shares, then the variables will have the
      following values:
      UserTotalShares: 10
      UserUsedShares: 8
      $(UserRemainingShares): 2
      $(UserHasShares): 1
      $(UserSharesExceeded): 0
      UserUsedBgShares: 3
      $(UserRemainingBgShares): 7
      $(UserHasBgShares): 1
      $(UserBgSharesExceeded): 0

      If a group has the fair_shares keyword set to 10 in its group stanza and the group
      has used up 15 CPU shares and 0 Blue Gene shares, then the variables will have
      the following values:
      GroupTotalShares: 10
      GroupUsedShares: 15
      $(GroupRemainingShares): -5
      $(GroupHasShares): 0
      $(GroupSharesExceeded): 1
      GroupUsedBgShares: 0
      $(GroupRemainingBgShares): 10
      $(GroupHasBgShares): 1
      $(GroupBgSharesExceeded): 0

      The values of the following variables for a Blue Gene job step:
      JobIsBlueGene: 1
      $(JobIsNotBlueGene): 0

      The values of the following variables for a non-Blue Gene job step:
      JobIsBlueGene: 0
      $(JobIsNotBlueGene): 1

Reconfiguring fair share scheduling keywords
      LoadLeveler configuration and administration files can be modified to assign new
      values to various keywords.

      After files have been modified, issue the llctl -g reconfig command to read in the
      new keyword values. All new keywords introduced for fair share scheduling
      become effective right after reconfiguration.



                                         Chapter 6. Performing additional administrator tasks   163
Reconfiguring when the Schedd daemons are up
                        To avoid any inconsistency, change the value of the FAIR_SHARE_INTERVAL
                        keyword while the central manager and all Schedd daemons are up, then do the
                        reconfiguration.

                        After the reconfiguration, the following will happen:
                        v All historic fair share scheduling data will be decayed to the current time using
                          the old value.
                        v The old value is replaced with the new value
                        v The new value will be used from here on

                        Note:
                                1. You must have the same value for the FAIR_SHARE_INTERVAL
                                   keyword in the central manager and the Schedd daemons because the
                                   FAIR_SHARE_INTERVAL keyword determines the rate of decay for the
                                   historic fair share data and the same value on the daemons maintains the
                                   data consistency.
                                2. There are some LoadLeveler configuration parameters that require
                                   restarting LoadLeveler with llctl recycle for changes to take effect. You
                                   can use llctl recycle when changing fair share parameters also. The effect
                                   will be the same as using llctl reconfig because when the Schedd
                                   machine shuts down normally, the fair share scheduling data will be
                                   decayed to the time of the shutdown and it will be saved.

                        Reconfiguring when the Schedd daemons are down
                        The value for the FAIR_SHARE_INTERVAL keyword may need to be changed
                        while a Schedd daemon is down.

                        If the value for the FAIR_SHARE_INTERVAL keyword has to be changed while a
                        Schedd daemon is down, the following will happen when the Schedd daemon is
                        restarted:
                        v All historic fair share scheduling data will be read in from the disk files in the
                           $(SPOOL) directory with no change.
                        v When a new job ends, the historic fair share scheduling data for the owner and
                           the LoadLeveler group of the job will be updated using the new value and then
                           sent to the central manager. The new value is used effectively from the time the
                           data was last updated before the Schedd went down, not from the time of the
                           reconfiguration as it would normally be.

             Example: three groups share a LoadLeveler cluster
                        This example in which three groups share a LoadLeveler cluster may apply to your
                        situation.

                        For purposes of this example, we will assume the following:
                        v Three groups of users share a LoadLeveler cluster and each group is to have one
                          third of the resources
                        v Historic data will have significant impact for about 10 days
                        v Groups with unused shares will have much higher job priorities than the groups
                          which have used up their shares
                        To setup for fair share scheduling with these assumptions, an administrator could
                        update the LoadL_config global configuration file as follows:



164   TWS LoadLeveler: Using and Administering
FAIR_SHARE_TOTAL_SHARES = 99

      FAIR_SHARE_INTERVAL = 240

      NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 300

      GroupRemainingShares = ( GroupTotalShares - GroupUsedShares )

      GroupHasShares = ( $(GroupRemainingShares) > 0 )

      SYSPRIO : 10000000 * $(GroupHasShares) - QDate

      In the admin file LoadL_admin, add:
      chemistry: type = group

        include_users = harold mark kim enci george charlie

        fair_shares = 33

      physics: type = group

        include_users = cnyang gchen newton roy

        fair_shares = 33

      math: type = group

        include_users = rich dave chris popco
        fair_shares = 33

      When user rich in the math group wants to submit a job, the following keyword
      can be put into the job command file so that the job will have high priority
      through the math group:
      #@group=math

      If user rich has a job that does not need to be run right away or as soon as
      possible (can be run at any time), then he should run the job in a LoadLeveler
      group with no shares allocated (for example, the No_Group group). Because the
      group No_Group has no shares allocated to it in this example, $(GroupHasShares)
      has a value of 0 and the job priority will be lower than those jobs whose group has
      unused shares. The job will be run when all higher priority jobs are done or when
      it is used to backfill a higher priority job (will be run whenever it can be
      scheduled).

Example: two thousand students share a LoadLeveler cluster
      This example in which two thousand students share a LoadLeveler cluster may
      apply to your situation.

      For purposes of this example, we will assume the following:
      v A university has 2000 students who share a LoadLeveler cluster and every
        student is to have the same number of shares of the resources.
      v Historic data will have significant impact for about 7 days (because
        FAIR_SHARE_INTERVAL is not specified and the default value is 7 days).
      v A student with unused shares is to have somewhat higher job priorities and let
        the priorities decrease as the number of used shares increase.
      The LoadL_config global configuration file should contain the following:




                                         Chapter 6. Performing additional administrator tasks   165
FAIR_SHARE_TOTAL_SHARES = 10000

                        NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 600

                        UserRemainingShares = ( UserTotalShares - UserUsedShares )

                        SYSPRIO : 100000 * $(UserRemainingShares) - QDate

                        In the LoadL_admin admin file, add
                        default: type = user

                          fair_shares = 5

                        Note: The value fair_shares = 5 is the result of dividing the total shares into the
                              number of students (10000 ÷ 2000). The number of students can be more or
                              less than 2000, but the same configuration parameters still prevent a single
                              user from using too much cluster resources in a short time period.

                        We can see from the SYSPRIO expression that the larger the number of unused
                        shares for a student and the earlier the job is submitted, the higher the priority is
                        for the student’s job.

             Querying information about fair share scheduling
                        The llfs command, the GUI, and the data access API can be used to query
                        information about fair share scheduling.

                        The llfs command without any options displays the allocated and used shares for
                        all users and LoadLeveler groups having run one or more jobs in the cluster to
                        completion. The -u and -g options can show the allocated and used shares for any
                        user or LoadLeveler group regardless of whether they have run any jobs in the
                        cluster. In either case, the user or group need not have any fair_shares allocated in
                        the LoadL_admin administration file for the usage to be reported by the llfs
                        command.

             Resetting fair share scheduling
                        The llfs -r command option (or the GUI option Reset historic data), by default,
                        will start fair share scheduling from the beginning, which means that all the
                        previous historic data will be lost.

                        This command will not be run unless all Schedd daemons are up and running.

                        In case a Schedd daemon is down when this command option is being run, the
                        request will not be processed. To manually reset fair share scheduling, bring down
                        the LoadLeveler cluster, remove all fair share data files (fair_share_queue.dir and
                        fair_share_queue.pag) in the $(SPOOL) directory and then restart the LoadLeveler
                        cluster.

             Saving historic data
                        The LoadLeveler central manager holds the complete historic fair share data when
                        it is up.

                        Every Schedd holds a portion of the historic fair share data and the data is stored
                        on disk in the $(SPOOL) directory. When the central manager is restarted, it
                        receives the historic fair share data from every Schedd. If a Schedd machine is
                        down temporarily and the central manager remains up, the data in the central
                        manager is not affected. In case a Schedd machine is permanently damaged and

166   TWS LoadLeveler: Using and Administering
the central manager restarts, the central manager will not be able to get all of the
              historic fair share data because the data stored on the damaged Schedd is lost. If
              the value of FAIR_SHARE_INTERVAL is very large, many days of data on the
              damaged Schedd could be lost. To reduce the loss of data, the historic fair share
              data in the central manager can be saved to disk periodically. Recovery can be
              done using the latest saved data when a Schedd machine is permanently out of
              service. The llfs -s command, the GUI, or the ll_fair_share API can be used to save
              a snapshot of the historic data in the central manager to a file.

        Restoring saved historic data
              You can use the llfs -r command option, the GUI, or the ll_fair_share API to
              restore fair share scheduling to a previously saved state.

              For the file name, specify a file you saved previously using llfs -s.

              If the central manager goes down and restarts again, the historic data stored in an
              out of service Schedd machine is not reported to the central manager. If the Schedd
              machine will not be brought back to service at all, then the administrator can
              consider restoring fair share scheduling to a state corresponding to the latest saved
              file.

Procedure for recovering a job spool
              The llmovespool command is intended for recovery purposes only.

              Jobs being managed by a down Schedd are unable to clean up resources or move
              to completion. These jobs need their job records transferred to another Schedd. The
              llmovespool command moves the job records from the spool of one managing
              Schedd to another managing Schedd in the local cluster. All moved jobs retain
              their original job identifiers.

              It is very important that the Schedd that created the job records to be moved is not
              running during the move operation. Jobs within the job queue database will be
              unrecoverable if the job queue is updated during the move by any process other
              than the llmovespool command.

              The llmovespool command operates on a set of job records, these records are
              updated as the command executes. When a job is successfully moved, the records
              for that job are deleted. Job records that are not moved because of a recoverable
              failure, like the original Schedd not being fenced, may have the llmovespool
              command executed against them again. It is very important that a Schedd never
              reads the job records from the spool being moved. Jobs will be unrecoverable if
              more than one Schedd is considered to be the managing Schedd.

              The procedure for recovering a job spool is:
              1. Move the files located in the spool directory to be transferred to another
                 directory before entering the llmovespool command in order to guarantee that
                 no other Schedd process is updating the job records.
              2. Add the statement schedd_fenced=true to the machine stanza of the original
                 Schedd node in order to guarantee that the central manager ignores
                 connections from the original managing Schedd, and to prevent conflicts from
                 arising if the original Schedd is restarted after the llmovespool command has
                 been run. See the schedd_fenced=true keyword in Chapter 13, “Administration
                 file reference,” on page 321 for more information.


                                                  Chapter 6. Performing additional administrator tasks   167
3. Reconfigure the central manager node so that it recognizes that the original
                           Schedd is ″fenced″.
                        4. Issue the llmovespool command providing the spool directory where the job
                           records are stored. The command displays a message that the transfer has
                           started and reports status for each job as it is processed. For more information
                           about the llmovespool command, see “llmovespool - Move job records” on
                           page 472. For more information about the ll_move_spool API, see
                           “ll_move_spool subroutine” on page 683.




168   TWS LoadLeveler: Using and Administering
Chapter 7. Using LoadLeveler’s GUI to perform administrator
    tasks
|                  Note: This is the last release that will provide the Motif-based graphical user
|                  interface xloadl. The function available in xloadl has been frozen since TWS
|                  LoadLeveler 3.3.2.

                   The end user can perform many tasks more efficiently and faster using the
                   graphical user interface (GUI), but there are certain tasks that end users cannot
                   perform unless they have the proper authority.

                   If you are defined as a LoadLeveler administrator in the LoadLeveler configuration
                   file then you are immediately granted administrative authority and can perform
                   the administrative tasks discussed in this topic. To find out how to grant someone
                   administrative authority, see “Defining LoadLeveler administrators” on page 43.

                   You can access LoadLeveler administrative commands using the Admin pull-down
                   menu on both the Jobs window and the Machines window of the GUI. The Admin
                   pull-down menu on the Jobs window corresponds to the command options
                   available in the llhold, llfavoruser, and llfavorjob commands. The Admin
                   pull-down menu on the Machines window corresponds to the command options
                   available in the llctl command.

                   The main window of the GUI has three sub-windows: one for job status with
                   pull-down menus for job-related commands, one for machine status with
                   pull-down menus for machine-related commands, and one for messages and logs
                   (see “The LoadLeveler main window” on page 404 in the Chapter 15, “Graphical
                   user interface (GUI) reference,” on page 403). There are a variety of facilities
                   available that allow you to sort and select the items displayed.

    Job-related administrative actions
                   You access the administrative commands that act on jobs through the Admin
                   pull-down menu in the Jobs window of the GUI.

                   You can perform the following tasks with this menu:
                   Favor Users
                      Allows you to favor users. This means that you can select one or more users
                      whose jobs you want to move up in the job queue. This corresponds to the
                      llfavoruser command.
                       Select Admin from the Jobs window
                       Select Favor User
                                  The Order by User window appears.
                       Type in
                                 The name of the user whose jobs you want to favor.
                       Press     OK




                                                                                                       169
Unfavor Users
                           Allows you to unfavor users. This means that you want to unfavor the user’s
                           jobs which you previously favored. This corresponds to the llfavoruser
                           command.
                            Select Admin from the Jobs window
                            Select Unfavor User
                                       The Order by User window appears.
                            Type in
                                      The name of the user for whom you want to unfavor their jobs.
                            Press     OK
                        Favor Jobs
                           Allows you to select a job that you want to favor. This corresponds to the
                           llfavorjob command.
                            Select One or more jobs from the Jobs window
                            Select Admin from the Jobs window
                            Select Favor Job
                                       The selected jobs are favored.
                            Press     OK
                        Unfavor Jobs
                           Allows you select a job that you want to unfavor. This corresponds to the
                           llfavorjob command.
                            Select One or more jobs from the Jobs window
                            Select Admin from the Jobs window
                            Select Unfavor Job
                                       Unfavors the jobs that you previously selected.
                        Syshold
                           Allows you to place a system hold on a job. This corresponds to the llhold
                           command.
                            Select A job from the Jobs window
                            Select Admin pull-down menu from the Jobs window
                            Select Syshold to place a system hold on the job.
                        Release From Hold
                           Allows you to release the system hold on a job. This corresponds to the llhold
                           command.
                            Select A job from the Jobs window
                            Select Admin pull-down menu from the Jobs window
                            Select Release From Hold to release the system hold on the job.
                        Preempt
                            Available when using the BACKFILL or external schedulers. Preempt allows
                            you to place the selected jobs in preempted state. This action corresponds to
                            the llpreempt command.
                            Select One or more jobs from the Jobs window


170   TWS LoadLeveler: Using and Administering
Select Admin pull-down menu from the Jobs window
   Select Preempt
Resume Preempted Job
   Available only when using the BACKFILL or external schedulers. Resume
   Preempted Job allows you to remove user-initiated preemption (initiated using
   the Preempt menu option or the llpreempt command) from the selected jobs.
   This action corresponds to the llpreempt -r command.
   Select One or more jobs from the Jobs window
   Select Admin pull-down menu from the Jobs window
   Select Resume Preempted Job
Prevent Preempt
    Available only when using the BACKFILL or API scheduler. Prevent Preempt
    allows you to place the selected running job into a non-preemptable state.
    When the BACKFILL or API scheduler is in use, this is equivalent to the
    llmodify -p nopreempt command.
   Select One job from the Jobs window
   Select Admin pull-down menu from the Jobs window
   Select Prevent Preempt
Allow Preempt
    Available only when using the BACKFILL or API scheduler, Allow Preempt
    makes the unpreemptable job preemptable again. When the BACKFILL or API
    scheduler is in use, this is equivalent to the llmodify -p preempt command.
   Select One or more jobs from the Jobs window
   Select Admin pull-down menu from the Jobs window
   Select Allow Preempt
Extend Wallclock Limits
    Allows you to extend the wallclock limits by the number of minutes specified.
    This corresponds to the llmodify -W command.
   Select Admin pull-down window from the Jobs window
   Select Extend Wallclock Limit
              The Extend Wallclock Limits window appears.
   Type in
             The number of minutes to extend the wallclock limit.
   Press     OK
Modify Job Priority
  Allows you to modify the system priority of a job step. This corresponds to the
  llmodify -s command.
   Select Admin pull-down window from the Jobs window
   Select Modify Job Priority
              The Modify Job Priority window appears.
   Type in
             An integer value for system priority.
   Press     OK


                        Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks   171
Move to another cluster
                          Allows you to move an idle job from the local cluster to another. This menu
                          items appears only when a mulitcluster environment is configured. It
                          corresponds to the llmovejob command.
                            Select Admin pull-down window from the Jobs window
                            Select Modify Job Priority
                                       The Move Job to Another Cluster window appears.
                            Select The name of the target cluster.
                            Press    OK

Machine-related administrative actions
                        You access the administrative commands that act on machines using the Admin
                        pull-down menu in the Machines window of the GUI.

                        Using the GUI pull-down menu, you can perform the tasks described in this topic.
                        Start All
                            Starts LoadLeveler on all machines listed in machine stanzas beginning with
                            the central manager. Submit-only machines are skipped. Use this option when
                            specifying alternate central managers in order to ensure the primary central
                            manager starts before any alternate central manager attempts to serve as
                            central manager.
                            Select Admin from the Machines window.
                            Select Start All
                        Start LoadLeveler
                            Allows you to start LoadLeveler on selected machines.
                            Select One or more machines on which you want to start LoadLeveler.
                            Select Admin from the Machines window.
                            Select Start LoadLeveler
                        Start Drained
                            Allows you to start LoadLeveler with startd drained on selected machines.
                            Select One or more machines on which you want startd drained.
                            Select Admin from the Machines window.
                            Select Start Drained
                        Stop LoadLeveler
                           Allows you to stop LoadLeveler on selected machines.
                            Select One or more machines on which you want to stop LoadLeveler.
                            Select Admin from the Machines window.
                            Select Stop LoadLeveler.
                        Stop All
                           Stops LoadLeveler on all machines listed in machine stanzas. Submit-only
                           machines are skipped.
                            Select Admin from the Machines window.
                            Select Stop All

172   TWS LoadLeveler: Using and Administering
Reconfig
   Forces all daemons to reread the configuration files
   Select The machine on which you want to operate. To reconfigure this xloadl
          session, choose reconfig but do not select a machine.
   Select Admin from the Machines window.
   Select reconfig
Recycle
   Stops all LoadLeveler daemons and restarts them.
   Select The machine on which you want to operate.
   Select Admin from the Machines window.
   Select recycle
Configuration Tasks
   Starts Configuration Tasks wizard
   Select Admin from the Machines window.
   Select Config Tasks

   Note: Use the invoking script lltg to start the wizard outside of xloadl. This
   option will appear on the pull-down only if the LoadL.tguides fileset is
   installed.
Drain
   Allows no more LoadLeveler jobs to begin running on this machine but it does
   allow running jobs to complete.
   Select The machine on which you want to operate.
   Select Admin from the Machines window.
   Select drain.
           A cascading menu allows you to select either daemons, Schedd, startd,
           or startd by class. If you select daemons, both the startd and the
           Schedd on the selected machine will be drained. If you select Schedd,
           only the Schedd on the selected machine will be drained. If you select
           startd, only the startd on the selected machine will be drained. If you
           select startd by class, a window appears which allows you to select
           classes to be drained.
Flush
    Terminates running jobs on this host and sends them back to the system queue
    to await redispatch. No new jobs are redispatched to this machine until resume
    is issued. Forces a checkpoint if jobs are enabled for checkpointing.
   Select The machine on which you want to operate.
   Select Admin from the Machines window.
   Select flush
Suspend
   Suspends all jobs on this host.
   Select The machine on which you want to operate.
   Select Admin from the Machines window.
   Select suspend


                       Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks   173
Resume
                           Resumes all jobs on this machine.
                            Select The machine on which you want to operate.
                            Select Admin from the Machines window
                            Select resume
                                     A cascading menu allows you to select either daemons, Schedd, startd,
                                     or startd by class. If you select daemons, both machines will be
                                     resumed. If you select Schedd, only the Schedd on the selected
                                     machine will be resumed. If you select startd, only the startd on the
                                     selected machine will be resumed. If you select startd by class, a
                                     window appears which allows you to select classes to be resumed.
                        Capture Data
                           Collects information on the machines selected.
                            Select The machine on which you want to operate.
                            Select Admin from the Machines window.
                            Select Capture Data.
                        Collect Account Data
                            Collects accounting data on the machines selected.
                            Select The machine on which you want to operate.
                            Select Admin from the Machines window.
                            Select Collect Account Data.
                                     A window appears prompting you to enter the name of the directory
                                     in which you want the collected data stored.
                        Collect Reservation Data
                            Collects reservation data on the machines selected.
                            Select The machine on which you want to operate.
                            Select Admin from the Machines window.
                            Select Collect Reservation Data.
                                     A window appears prompting you to enter the name of the directory
                                     in which you want the collected data stored.
                        Create Account Report
                           Creates an accounting report for you.
                            Select Admin → Create Account Report...
                                     Note: If you want to receive an extended accounting report, select the
                                     extended cascading button.
                                     A window appears prompting you to enter the following information:
                                     v A short, long, or extended version of the output. The short version is
                                       the default.
                                     v The user ID
                                     v The class name
                                     v The LoadL (LoadLeveler) group name
                                     v The UNIX group name
                                     v The Allocated host
                                     v The job ID
                                     v The report Type

174   TWS LoadLeveler: Using and Administering
v The section
            v A start and end date for the report. If no date is specified, the
              default is to report all of the data in the report.
            v The name of the input data file.
            v The name of the output data file. This is the same as stdout.
    Press   OK
            The window closes and you return to the main window. The report
            appears in the Messages window if no output data file was specified.
Move Spool
  Moves the job records from the spool of one managing Schedd to another
  managing Schedd in the local cluster. This is intended for recovery purposes
  only.
    Select One Schedd machine from the Machines window.
    Select Admin from the Machines window.
    Select Move Spool
            A window is displayed prompting you to enter the directory
            containing the job records to be moved.
    Press   OK
Version
    Displays version and release data for LoadLeveler on the machines selected in
    an information window.
    Select The machine on which you want to operate.
    Select Admin from the Machines window.
    Select version
Fair Share Scheduling
    Provides fair share scheduling functions (see “llfs - Fair share scheduling
    queries and operations” on page 450).
    Select Admin from the Machines window.
    Select Fair Share Scheduling
    A cascading menu allows you to select one of the following:
    v Show
      Displays fair share scheduling information for all users or for specified users
      and groups.
    v Save historic data
      Saves fair share scheduling information into the directory specified.
    v Restore historic data
      Restores fair share scheduling data to a state corresponding to a file
      previously saved by Save historic data or the llfs -s command.
    v Reset historic data
      Erases all historic CPU data to reset fair share scheduling.




                       Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks   175
176   TWS LoadLeveler: Using and Administering
Part 3. Submitting and managing TWS LoadLeveler jobs
            After an administrator installs IBM Tivoli Workload Scheduler (TWS) LoadLeveler
            and customizes the environment, general users can build and submit jobs to
            exploit the many features of the TWS LoadLeveler runtime environment.




                                                                                        177
178   TWS LoadLeveler: Using and Administering
Chapter 8. Building and submitting jobs
              Learn more about building and submitting jobs.

              The topics listed Table 40 will help you learn about building and submitting jobs:
              Table 40. Learning about building and submitting jobs
              To learn about:                          Read the following:
              Creating and submitting serial and       Chapter 8, “Building and submitting jobs”
              parallel jobs
              Controlling and monitoring TWS           Chapter 9, “Managing submitted jobs,” on page
              LoadLeveler jobs                         229
              Ways to control or monitor TWS      v Chapter 16, “Commands,” on page 411
              LoadLeveler operations by using the
                                                  v Chapter 10, “Example: Using commands to
              TWS LoadLeveler commands, GUI, and
                                                    build, submit, and manage jobs,” on page 235
              APIs
                                                  v Chapter 11, “Using LoadLeveler’s GUI to build,
                                                    submit, and manage jobs,” on page 237
                                                       v Chapter 17, “Application programming
                                                         interfaces (APIs),” on page 541


              Table 41 lists the tasks that general users perform to run LoadLeveler jobs.
              Table 41. Roadmap of user tasks for building and submitting jobs
              To learn about:                   Read the following:
              Building jobs                     v “Building a job command file”
                                                v “Editing job command files” on page 185
                                                v “Defining resources for a job step” on page 185
                                                v “Working with coscheduled job steps” on page 187
                                                v “Using bulk data transfer” on page 188
                                                v “Preparing a job for checkpoint/restart” on page 190
                                                v “Preparing a job for preemption” on page 193
              Submitting jobs                   v “Submitting a job command file” on page 193
                                                v “llsubmit - Submit a job” on page 531
              Working with parallel jobs        “Working with parallel jobs” on page 194
              Working with reserved node        “Working with reservations” on page 213
              resources and the jobs that use
              them
              Correctly specifying job          Chapter 14, “Job command file reference,” on page 357
              command file keywords



Building a job command file
              Before you can submit a job or perform any other job related tasks, you need to
              build a job command file.

              A job command file describes the job you want to submit, and can include
              LoadLeveler keyword statements. For example, to specify a binary to be executed,

                                                                                                     179
you can use the executable keyword, which is described later in this topic. To
                        specify a shell script to be executed, the executable keyword can be used; if it is
                        not used, LoadLeveler assumes that the job command file itself is the executable.

                        The job command file can include the following:
                        v LoadLeveler keyword statements: A keyword is a word that can appear in job
                          command files. A keyword statement is a statement that begins with a
                          LoadLeveler keyword. These keywords are described in “Job command file
                          keyword descriptions” on page 359.
                        v Comment statements: You can use comments to document your job command
                          files. You can add comment lines to the file as you would in a shell script.
                        v Shell command statements: If you use a shell script as the executable, the job
                          command file can include shell commands.
                        v LoadLeveler variables: See “Job command file variables” on page 399 for more
                          information.

                        You can build a job command file either by using the Build a Job window on the
                        GUI or by using a text editor.

             Using multiple steps in a job command file
                        To specify a stream of job steps, you need to list each job step in the job command
                        file.

                        You must specify one queue statement for each job step. Also, the executables for
                        all job steps in the job command file must exist when you submit the job. For most
                        keywords, if you specify the keyword in a job step of a multi-step job, its value is
                        inherited by all proceeding job steps. Exceptions to this are noted in the keyword
                        description.

                        LoadLeveler treats all job steps as independent job steps unless you use the
                        dependency keyword. If you use the dependency keyword, LoadLeveler
                        determines whether a job step should run based upon the exit status of the
                        previously run job step.

                        For example, Figure 19 on page 181 contains two separate job steps. Notice that
                        step1 is the first job step to run and that step2 is a job step that runs only if step1
                        exits with the correct exit status.




180   TWS LoadLeveler: Using and Administering
#   This job command file lists two job steps called "step1"
     #   and "step2". "step2" only runs if "step1" completes
     #   with exit status = 0. Each job step requires a new
     #   queue statement.
     #
     #   @   step_name = step1
     #   @   executable = executable1
     #   @   input = step1.in1
     #   @   output = step1.out1
     #   @   error = step2.err1
     #   @   queue
     #   @   dependency = (step1 == 0)
     #   @   step_name = step2
     #   @   executable = executable2
     #   @   input = step2.in1
     #   @   output = step2.out1
     #   @   error = step2.err1
     #   @   queue

     Figure 19. Job command file with multiple steps

     In Figure 19, step1 is called the sustaining job step. step2 is called the dependent job
     step because whether or not it begins to run is dependent upon the exit status of
     step1. A single sustaining job step can have more than one dependent job steps
     and a dependent job step can also have job steps dependent upon it.

     In Figure 19, each job step has its own executable, input, output, and error
     statements. Your job steps can have their own separate statements, or they can use
     those statements defined in a previous job step. For example, in Figure 20, step2
     uses the executable statement defined in step1:

     #   This job command file uses only one executable for
     #   both job steps.
     #
     #   @   step_name = step1
     #   @   executable = executable1
     #   @   input = step1.in1
     #   @   output = step1.out1
     #   @   error = step1.err1
     #   @   queue
     #   @   dependency = (step1 == 0)
     #   @   step_name = step2
     #   @   input = step2.in1
     #   @   output = step2.out1
     #   @   error = step2.err1
     #   @   queue

     Figure 20. Job command file with multiple steps and one executable

Examples: Job command files
     These examples of job command files may apply to your situation.
     v Example 1: Generating multiple jobs with varying outputs
       To run a program several times, varying the initial conditions each time, you
       could can multiple LoadLeveler scripts, each specifying a different input and
       output file as described in Figure 22 on page 183. It would probably be more
       convenient to prepare different input files and submit the job only once, letting
       LoadLeveler generate the output files and do the multiple submissions for you.
       Figure 21 on page 182 illustrates the following:
       – You can refer to the LoadLeveler name of your job symbolically, using
          $(jobid) and $(stepid) in the LoadLeveler script file.
       – $(jobid) refers to the job identifier.

                                                       Chapter 8. Building and submitting jobs   181
– $(stepid) refers to the job step identifier and increases after each queue
                              command. Therefore, you only need to specify input, output, and error
                              statements once to have LoadLeveler name these files correctly.
                            Assume that you created five input files and each input file has different initial
                            conditions for the program. The names of the input files are in the form
                            longjob.in.x, where x is 0–4.
                            Submitting the LoadLeveler script shown in Figure 21 results in your program
                            running five times, each time with a different input file. LoadLeveler generates
                            the output file from the LoadLeveler job step IDs. This ensures that the results
                            from the different submissions are not merged.

                        #   @   executable = longjob
                        #   @   input = longjob.in.$(stepid)
                        #   @   output = longjob.out.$(jobid).$(stepid)
                        #   @   error = longjob.err.$(jobid).$(stepid)
                        #   @   queue
                        #   @   queue
                        #   @   queue
                        #   @   queue
                        #   @   queue

                        Figure 21. Job command file with varying input statements

                            To submit the job, type the command:
                            llsubmit longjob.cmd

                            LoadLeveler responds by issuing the following:
                            submit: The job "ll6.23" with 5 job steps has been submitted.

                            Table 42 lists the standard input files, standard output files, and standard error
                            files for the five job steps:
                        Table 42. Standard files for the five job steps
                        Job Step                 Standard Input           Standard Output    Standard Error
                        ll6.23.0                 longjob.in.0             longjob.out.23.0   longjob.err.23.0
                        ll6.23.1                 longjob.in.1             longjob.out.23.1   longjob.err.23.1
                        ll6.23.2                 longjob.in.2             longjob.out.23.2   longjob.err.23.2
                        ll6.23.3                 longjob.in.3             longjob.out.23.3   longjob.err.23.3
                        ll6.23.4                 longjob.in.4             longjob.out.23.4   longjob.err.23.4

                        v Example 2: Using LoadLeveler variables in a job command file
                          Figure 22 on page 183 shows how you can use LoadLeveler variables in a job
                          command file to assign different names to input and output files. This example
                          assumes the following:
                          – The name of the machine from which the job is submitted is lltest1
                          – The user’s home directory is /u/rhclark and the current working directory is
                             /u/rhclark/OSL
                          – LoadLeveler assigns a value of 122 to $(jobid).
                          In Job Step 0:
                          – LoadLeveler creates the subdirectories oslsslv_out and oslsslv_err if they do
                             not exist at the time the job step is started.
                          In Job Step 1:



182   TWS LoadLeveler: Using and Administering
– The character string rhclark denotes the home directory of user rhclark in
       input, output, error, and executable statements.
    – The $(base_executable) variable is set to be the “base” portion of the
       executable, which is oslsslv.
    – The $(host) variable is equivalent to $(hostname). Similarly, $(jobid) and
       $(stepid) are equivalent to $(cluster) and $(process), respectively.
    In Job Step 2:
    – This job step is executed only if the return codes from Step 0 and Step 1 are
      both equal to zero.
    – The initial working directory for Step 2 is explicitly specified.

#   Job step 0 ============================================================
#     The names of the output and error files created by this job step are:
#
#        output: /u/rhclark/OSL/oslsslv_out/lltest1.122.0.out
#        error : /u/rhclark/OSL/oslsslv_err/lltest1_122_0_err
#
#   @   job_name = OSL
#   @   step_name = step_0
#   @   executable = oslsslv
#   @   arguments = -maxmin=min -scale=yes -alg=dual
#   @   environment = OSL_ENV1=20000; OSL_ENV2=500000
#   @   requirements = (Arch == "R6000") && (OpSys == "AIX53")
#   @   input = test01.mps.$(stepid)
#   @   output = $(executable)_out/$(host).$(jobid).$(stepid).out
#   @   error = $(executable)_err/$(host)_$(jobid)_$(stepid)_err
#   @   queue
#
#   Job step 1 ============================================================
#     The names of the output and error files created by this job step are:
#
#        output: /u/rhclark/OSL/oslsslv_out/lltest1.122.1.out
#        error : /u/rhclark/OSL/oslsslv_err/lltest1_122_1_err
#
#   @   step_name = step_1
#   @   executable = rhclark/$(job_name)/oslsslv
#   @   arguments = -maxmin=max -scale=no -alg=primal
#   @   environment = OSL_ENV1=60000; OSL_ENV2=500000; 
                      OSL_ENV3=70000; OSL_ENV4=800000;
#   @   input = rhclark/$(job_name)/test01.mps.$(stepid)
#   @   output = rhclark/$(job_name)/$(base_executable)_out/$(hostname).$(cluster).$(process).out
#   @   error = rhclark/$(job_name)/$(base_executable)_err/$(hostname)_$(cluster)_$(process)_err
#   @   queue
#
#   Job step 2 ============================================================
#     The names of the output and error files created by this job step are:
#
#        output: /u/rhclark/OSL/oslsslv_out/lltest1.122.2.out
#        error : /u/rhclark/OSL/oslsslv_err/lltest1_122_2_err
#
#   @   step_name = OSL
#   @   dependency = (step_0 == 0) && (step_1 == 0)
#   @   comment = oslsslv
#   @   initialdir = /u/rhclark/$(step_name)
#   @   arguments = -maxmin=min -scale=yes -alg=dual
#   @   environment = OSL_ENV1=300000; OSL_ENV2=500000
#   @   input = test01.mps.$(stepid)
#   @   output = $(comment)_out/$(host).$(jobid).$(stepid).out
#   @   error = $(comment)_err/$(host)_$(jobid)_$(stepid)_err
#   @   queue

Figure 22. Using LoadLeveler variables in a job command file

v Example 3: Using the job command file as the executable
  The name of the sample script shown in Figure 23 on page 185 is run_spice_job.
  This script illustrates the following:
  – The script does not contain the executable keyword. When you do not use
    this keyword, LoadLeveler assumes that the script is the executable. (Since the


                                                          Chapter 8. Building and submitting jobs   183
name of the script is run_spice_job, you can add the executable =
                             run_spice_job statement to the script, but it is not necessary.)
                           – The job consists of four job steps (there are 4 queue statements). The spice3f5
                             and spice2g6 programs are invoked at each job step using different input data
                             files:
                             - spice3f5: Input for this program is from the file spice3f5_input_x where x
                                has a value of 0, 1, and 2 for job steps 0, 1, and 2, respectively. The name of
                                this file is passed as the first argument to the script. Standard output and
                                standard error data generated by spice3f5 are directed to the file
                                spice3f5_output_x. The name of this file is passed as second argument to
                                the script. In job step 3, the names of the input and output files are
                                spice3f5_input_benchmark1 and spice3f5_output_benchmark1,
                                respectively.
                             - spice2g6: Input for this program is from the file spice2g6_input_x.
                                Standard output and standard error data generated by spice2g6 together
                                with all other standard output and standard error data generated by this
                                script are directed to the files spice_test_output_x and spice_test_error_x,
                                respectively. In job step 3, the name of the input file is
                                spice2g6_input_benchmark1. The standard output and standard error files
                                are spice_test_output_benchmark1 and spice_test_error_benchmark1.
                             All file names that are not fully qualified are relative to the initial working
                             directory /home/loadl/spice. LoadLeveler will send the job steps 0 and 1 of
                             this job to a machine for that has a real memory of 64 MB or more for
                             execution. Job step 2 most likely will be sent to a machine that has more that
                             128 MB of real memory and has the ESSL library installed since these
                             preferences have been stated using the LoadLeveler preferences keyword.
                             LoadLeveler will send job step 3 to the machine ll5.pok.ibm.com for
                             execution because of the explicit requirement for this machine in the
                             requirements statement.




184   TWS LoadLeveler: Using and Administering
#!/bin/ksh
               # @ job_name = spice_test
               # @ account_no = 99999
               # @ class = small
               # @ arguments = spice3f5_input_$(stepid) spice3f5_output_$(stepid)
               # @ input = spice2g6_input_$(stepid)
               # @ output = $(job_name)_output_$(stepid)
               # @ error = $(job_name)_error_$(stepid)
               # @ initialdir = /home/loadl/spice
               # @ requirements = ((Arch == "R6000") && 
               #           (OpSys == "AIX53") && (Memory > 64))
               # @ queue
               # @ queue
               # @ preferences = ((Memory > 128) && (Feature == "ESSL"))
               # @ queue
               # @ class = large
               # @ arguments = spice3f5_input_benchmark1 spice3f5_output_benchmark1
               # @ requirements = (Machine == "ll5.pok.ibm.com")
               # @ input = spice2g6_input_benchmark1
               # @ output = $(job_name)_output_benchmark1
               # @ error = $(job_name)_error_benchmark1
               # @ queue
               OS_NAME=`unamè

               case $OS_NAME in
                  AIX)
                     echo "Running $OS_NAME version of spice3f5" > $2
                     AIX_bin/spice3f5 < $1 >> $2 2>&1
                     echo "Running $OS_NAME version of spice2g6"
                     AIX_bin/spice2g6
                     ;;
                  *)
                     echo "spice3f5 for $OS_NAME is not available" > $2
                     echo "spice2g6 for $OS_NAME is not available"
                     ;;
               esac

               Figure 23. Job command file used as the executable




Editing job command files
               After you build a job command file, you can edit it using the editor of your choice.

               You may want to change the name of the executable or add or delete some
               statements.

               When you create a job command file, it is considered the job executable unless you
               specify otherwise by using the executable keyword in the job command file.
               LoadLeveler copies the executable to the spool directory unless the checkpoint
               keyword was set to yes or interval. Jobs that are to be checkpointed cannot be
               moved to the spool directory. Do not make any changes to the executable while the
               job is still in the queue–it could affect the way that job runs.

Defining resources for a job step
               The LoadLeveler user may use the resources keyword in the job command file to
               specify the resources to be consumed by each task of a job step.

               If the resources keyword is specified in the job command file, it overrides any
               default_resources specified by the administrator for the job step’s class.

                                                               Chapter 8. Building and submitting jobs   185
For example, the following job requests one CPU and one FRM license for each of
                            its tasks:
                            resources = ConsumableCpus(1) FRMlicense(1)

                            If this were specified in a serial job step, one CPU and one FRM license would be
                            consumed while the job step runs. If this were a parallel job step, then the number
                            of CPUs and FRM licenses consumed while the job step runs would depend upon
                            how many tasks were running on each machine. For more information on
                            assigning tasks to nodes, see “Task-assignment considerations” on page 196.

                            Alternatively, you can use the node_resources keyword in the job command file to
                            specify the resources to be consumed by the job step on each machine it runs on,
                            regardless of the number of tasks assigned to each machine. If the node_resources
                            keyword is specified in the job command file, it overrides the
                            default_node_resources specified by the administrator for the job step’s class.

                            For example, the following job requests 240 MB of ConsumableMemory on each
                            machine:
                            node_resources = ConsumableMemory(240 mb)

                            Even if one machine only runs one task of the job step, while other machines run
                            multiple tasks, 240 MB will be consumed on every machine.

|   Submitting jobs requesting data staging
|                           The dstg_in_script keyword causes LoadLeveler to generate an inbound data
|                           staging step, without requiring the #@queue specification. The value assigned to
|                           this keyword is the executable that will be started for data staging and any
|                           arguments needed by this script or executable as well.

|                           The dstg_in_wall_clock_limit keyword specifies a wall clock time for the inbound
|                           data staging step. Specifying the estimated wall clock limit is mandatory when a
|                           data staging script is specified. Similarly, dstg_out_script and
|                           dstg_out_wall_clock_limit will be used for generation and execution of the
|                           outbound data staging step for the job. All data staging job steps are assigned to
|                           the predefined class called data_stage.

|                           Resources required for data staging can be specified using the dstg_resources
|                           keyword.

|                           The dstg_node keyword allows you to specify how data replicas must be created:
|                           v If the value specified is any, one data staging task is executed on any available
|                             node in the cluster with data staging resources. This value can be used with
|                             either the at_submit or the just_in_time configuration options.
|                           v If the value specified is master, one data staging task is executed on the master
|                             node. The master node is the machine that will be used to run the inbound and
|                             outbound data staging steps as well as the first application step of the job.
|                           v If the value is all, a data staging task is executed on each of the nodes that will
|                             be or were used by the first application step.

|                           Any environment variables needed by the data staging scripts can be specified
|                           using the dstg_environment keyword. The copy_all value can be assigned to this
|                           keyword to get all of the user’s environment variables.



    186   TWS LoadLeveler: Using and Administering
|                 For detailed information about the data staging job command file keywords, see
|                 “Job command file keyword descriptions” on page 359.

    Working with coscheduled job steps
                  LoadLeveler allows you to specify that a group of two or more steps within a job
                  are to be coscheduled. Coscheduled steps are dispatched at the same time.

           Submitting coscheduled job steps
|                 The coschedule = yes keyword in the job command file is used to specify which
|                 steps within a job are to be coscheduled.

|                 All steps within a job with the coschedule keyword set to yes will be coscheduled.
|                 The coscheduled steps will continue to be stored as individual steps in both
|                 memory and in the job queue, but when performing certain operations, such as
|                 scheduling, the steps will be managed as a single entity. An operation initiated on
|                 one of the coscheduled steps will cause the operation to be performed on all other
|                 steps (unless the coscheduling dependency between steps is broken).

           Determining priority for coscheduled job steps
                  Coscheduled steps are supported only with the BACKFILL scheduler. The
                  LoadLeveler BACKFILL scheduler will only dispatch the set of coscheduled steps
                  when enough resource is available for all steps in the set to start.

                  If the set of coscheduled steps cannot be started immediately, but enough resource
                  will be available in the future, then the resource for all the steps will be reserved.
                  In this case, only one of the coscheduled steps will be designated as a top dog, but
                  enough resources will be reserved for all coscheduled steps and all the steps will
                  be dispatched when the top dog step is started. The coscheduled step with the
                  highest priority in the current job queue will be designated as the primary
                  coscheduled step and all other steps will be secondary coscheduled steps. The
                  primary coscheduled step will determine when the set of coscheduled steps will be
                  scheduled. The priority for all other coscheduled steps is ignored.

           Supporting preemption of coscheduled job steps
                  Preemption of coscheduled steps is supported.

                  Preemption of coscheduled steps is supported with the following restrictions:
                  v In order for a step S to be preemptable by a coscheduled step, all steps in the set
                    of coscheduled steps must be able to preempt step S.
                  v In order for a step S to preempt a coscheduled step, all steps in the set of
                    coscheduled steps must be preemptable by step S.
                  v The set of job steps available for preemption will be the same for all coscheduled
                    steps. Any resource made available by preemption for one coscheduled step will
                    be available to all other coscheduled steps.

                  To determine the preempt type and preempt method to use when a coscheduled
                  step preempts another step, an order of precedence for preempt types and preempt
                  methods has been defined. All steps in the preempting coscheduled step are
                  examined and the preempt type and preempt method having the highest
                  precedence are used. The order of precedence for preempt type will be ALL and
                  ENOUGH. The precedence order for preempt method is:
                  v Remove
                                                                Chapter 8. Building and submitting jobs   187
v   Vacate
                        v   System Hold
                        v   User hold
                        v   Suspend

                        For more information about preempt types and methods, see “Planning to preempt
                        jobs” on page 128.

                        When coscheduled steps are running, if one step is preempted as a result of a
                        system-initiated preemption, then all coscheduled steps are preempted. When
                        determining an optimal preempt set, the BACKFILL scheduler does not consider
                        coscheduled steps as a single entity. All coscheduled steps are in the initial
                        preempt set, but the final preempt set might not include all coscheduled steps, if
                        the scheduler determines the resources of some coscheduled steps are not
                        necessary to start the preempting job step. This implies that more resource than
                        necessary might be preempted when a coscheduled step is in the set of steps to be
                        preempted because regardless of whether or not all coscheduled steps are in the
                        preempt set, if one coscheduled step is preempted, then all coscheduled steps will
                        be preempted.

             Coscheduled job steps and commands and APIs
                        Commands and APIs that operate on job steps are impacted by coscheduled steps.

                        For the llbind, llcancel, llhold, and llpreempt commands, even if all coscheduled
                        steps are not in the list of targeted steps, the requested operation is performed on
                        all coscheduled steps.

                        For the llmkres and llchres commands, a coscheduled job step cannot be specified
                        when using the -j or -f flags. For the llckpt command, you cannot specify a
                        coscheduled job step using the -u flag.

             Termination of coscheduled steps
                        If a coscheduled step is dispatched but cannot be started and is rejected by the
                        startd daemon or the starter process, then all coscheduled steps are rejected.

                        If a running step is removed or vacated by LoadLeveler as a result of a system
                        related failure, then all coscheduled steps are removed or vacated. If a running
                        step is vacated as a result of the VACATE expression evaluating to true for the
                        step, then all coscheduled steps are vacated.

Using bulk data transfer
                        On systems with device drivers and network adapters that support remote
                        direct-memory access (RDMA), LoadLeveler supports bulk data transfer for jobs
                        that use either the Internet or user space communication protocol mode.

                        For jobs using the Internet protocol (IP jobs), LoadLeveler does not monitor or
                        control the use of bulk transfer. For user space jobs that request bulk transfer,
                        however, LoadLeveler creates a consumable RDMA resource requirement.
                        Machines with Switch Network Interface for HPS network adapters are
                        automatically given an RDMA consumable resource with an available amount of
                        four. Machines with InfiniBand switch adapters are given unlimited RDMA
                        consumable resources. Each step that requests bulk transfer consumes one RDMA
                        resource on each machine on which that step runs.

188   TWS LoadLeveler: Using and Administering
The RDMA resource is similar to user-defined consumable resources except in one
    important way: A user-specified resource requirement is consumed by every task
    of the job assigned to a machine, whereas the RDMA resource is consumed once
    on a machine no matter how many tasks of the job are running on the machine.
    Other than that exception, LoadLeveler handles the RDMA resource as it does all
    other consumable resources. LoadLeveler displays RDMA resources in the output
    of the following commands:
    v llq -l
    v llsummary -l

    LoadLeveler also displays RDMA resources in the output of the following
    commands for machines with Switch Network Interface for HPS network adapters:
    v llstatus -l
    v llstatus -R

    Bulk transfer is supported only on systems where the device driver of the network
    adapters supports RDMA. To determine which systems will support bulk transfer,
    use the llstatus command with the -l, -R, or -a flag to display machines with
    adapters that support RDMA. Machines with Switch Network Interface for HPS
    network adapters will have an RDMA resource listed in the command output of
|   llstatus -l and llstatus -R. The llstatus -a command displays the adapters list,
    which can be used to verify whether InfiniBand adapters are connected to the
    machines.

    Under certain conditions, LoadLeveler displays a total count of RDMA resources as
    less than four for machines with Switch Network Interface for HPS network
    adapters:
    v If jobs that LoadLeveler does not manage use RDMA, the amount of available
       RDMA resource reported to the Negotiator is reduced by the amount consumed
       by the unmanaged jobs.
    v In rare situations, LoadLeveler jobs can fail to release their adapter resources
       before reporting to the Negotiator that they have completed. When this occurs,
       the amount of available RDMA reported to the Negotiator is reduced by the
       amount consumed by the unreleased adapter resources. When the adapter
       resources are eventually released, the RDMA resource they consumed becomes
       available again.
    These conditions do not require corrective action.

    You do not need to perform specific job-definition tasks to enable bulk transfer for
    LoadLeveler jobs that use the IP network protocol. LoadLeveler cannot affect
    whether IP communication uses bulk transfer; the implementation of IP where the
    job runs determines whether bulk transfer is supported.

    To enable user space jobs to use bulk data transfer, however, all of the following
    tasks must be completed. If you omit one or more of these steps, the job will run
    but will not be able to use bulk transfer.
    v A LoadLeveler administrator must update the LoadLeveler configuration file to
       include the value RDMA in the SCHEDULE_BY_RESOURCES list for machines
       with Switch Network Interfaces for HPS network adapters. It is not required to
       include RDMA in the SCHEDULE_BY_RESOURCES list for machines with
       InfiniBand network adapters.
       Example:

      SCHEDULE_BY_RESOURCES = RDMA others


                                                  Chapter 8. Building and submitting jobs   189
v Users must request bulk transfer for their LoadLeveler jobs, using one of the
                          following methods:
                          – Specifying the bulkxfer keyword in the LoadLeveler job command file.
                             Example:

                              #@ bulkxfer=yes
                             If users specify this keyword for jobs that use the IP communication protocol,
                             LoadLeveler ignores the bulkxfer keyword.
                           – Specifying a POE line command parameter on interactive jobs.
                             Example:
                              poe_job -use_bulk_xfer=yes
                           – Specifying an environment variable on interactive jobs.
                             Example:
                              export MP_USE_BULK_XFER=yes
                                  poe_job
                        v Because LoadLeveler honors the bulk transfer request only for LAPI or MPI jobs,
                          users must ensure that the network keyword in the job command file specifies
                          the MPI, LAPI, or MPI_LAPI protocol for user space communication.
                          Examples:
                           network.MPI =sn_single,not_shared,US,HIGH
                           network.MPI_LAPI =sn_single,not_shared,US,HIGH


Preparing a job for checkpoint/restart
                        You can checkpoint your entire job step, and allow a job step to restart from the
                        last checkpoint.

                        LoadLeveler has the ability to checkpoint your entire job step, and to allow a job
                        step to restart from the last checkpoint. When a job step is checkpointed, the entire
                        state of each process of that job step is saved by the operating system. On AIX, this
                        checkpoint capability is built in to the base operating system.

                        Use the information in Table 43 on page 191 to correctly configure your job for
                        checkpointing.




190   TWS LoadLeveler: Using and Administering
Table 43. Checkpoint configurations
To specify that:        Do this:
Your job is             v Add either one of the following two options to your job
checkpointable            command file:
                          1. checkpoint = yes
                              This enables your job to checkpoint in any of the following
                              ways:
                              – The application can initiate the checkpoint. This is only
                                available on AIX.
                              – Checkpoint from a program which invokes the ll_ckpt API.
                              – Checkpoint using the llckpt command.
                              – As the result of a flush command.
                             OR
                          2. checkpoint = interval
                              This enables your job to checkpoint in any of the following
                              ways:
                              – The application can initiate the checkpoint. This is only
                                available on AIX.
                              – Checkpoint from a program which invokes the ll_ckpt API.
                              – Checkpoint using the llckpt command.
                              – Checkpoint automatically taken by LoadLeveler.
                              – As the result of a flush command.
                        v If you would like your job to checkpoint itself, use the API
                          ll_init_ckpt in your serial application, or mpc_init_ckpt for
                          parallel jobs to cause the checkpoint to occur. This is only
                          available on AIX.
Your job step’s         Add the ckpt_execute_dir keyword to the job command file.
executable is to be
copied to the execute
node




                                                  Chapter 8. Building and submitting jobs   191
Table 43. Checkpoint configurations (continued)
                        To specify that:         Do this:
                        LoadLeveler              1. Add the following option to your job command file:
                        automatically
                                                    checkpoint = interval
                        checkpoints your job
                        at preset intervals         This enables your job to checkpoint in any of the following
                                                    ways:
                                                    v Checkpoint automatically at preset intervals
                                                    v Checkpoint initiated from user application. This is only
                                                      available on AIX.
                                                    v Checkpoint from a program which invokes the ll_ckpt API
                                                    v Checkpoint using the llckpt command
                                                    v As the result of a flush command
                                                 2. The system administrators must set the following two keywords
                                                    in the configuration file to specify how often LoadLeveler
                                                    should take a checkpoint of the job. These two keywords are:
                                                    MIN_CKPT_INTERVAL = number
                                                       Where number specifies the initial period, in seconds,
                                                       between checkpoints taken for running jobs.
                                                    MAX_CKPT_INTERVAL = number
                                                      Where number specifies the maximum period, in seconds,
                                                      between checkpoints taken for running jobs.

                                                 The time between checkpoints will be increased after each
                                                 checkpoint within these limits as follows:
                                                 v The first checkpoint is taken after a period of time equal to the
                                                   MIN_CKPT_INTERVAL has passed.
                                                 v The second checkpoint is taken after LoadLeveler waits twice as
                                                   long (MIN_CKPT_INTERVAL X 2)
                                                 v The third checkpoint is taken after LoadLeveler waits twice as
                                                   long again (MIN_CKPT_INTERVAL X 4) before taking the third
                                                   checkpoint.
                                                 LoadLeveler continues to double this period until the value of
                                                 MAX_CKPT_INTERVAL has been reached, where it stays for the
                                                 remainder of the job.

                                                 A minimum value of 900 (15 minutes) and a maximum value of
                                                 7200 (2 hours) are the defaults.

                                                 You can set these keyword values globally in the global
                                                 configuration file so that all machines in the cluster have the same
                                                 value, or you can specify a different value for each machine by
                                                 modifying the local configuration files.
                        Your job will not be     Add the following option to your job command file:
                        checkpointed             v checkpoint = no

                                                 This will disable checkpoint.




192   TWS LoadLeveler: Using and Administering
Table 43. Checkpoint configurations (continued)
              To specify that:         Do this:
              Your job has             1. Add the following option to your job command file:
              successfully                v restart_from_ckpt = yes
              checkpointed and
                                       2. On AIX, specify the name of the checkpoint file by setting the
              terminated. The job
                                          following job command file keywords to specify the directory
              has left the
                                          and file name of the checkpoint file to be used:
              LoadLeveler job queue
                                          v ckpt_dir
              and you want
                                          v ckpt_file
              LoadLeveler to restart
              your executable from     When the job command file is submitted, a new job will be started
              an existing checkpoint   that uses the specified checkpoint file to restart the previously
              file.                    checkpointed job.

                                       The job command file which was used to submit the original job
                                       should be used to restart from checkpoint. The only modifications
                                       to this file should be the addition of restart_from_ckpt = yes and
                                       ensuring ckpt_dir and ckpt_file point to the appropriate checkpoint
                                       file.
              Your job has
              successfully             When the job restarts, if a checkpoint file is available, the job will
              checkpointed. The job    be restarted from that file.
              has been vacated and
              remains on the           If a checkpoint file is not available upon restart, the job will be
              LoadLeveler job          started from the beginning.
              queue.



Preparing a job for preemption
              Depending on various configuration options, LoadLeveler may preempt your job
              so that a higher priority job step can run.

              Administrators may:
              v Configure LoadLeveler or external schedulers to preempt jobs through various
                methods.
              v Specify preemption rules for job classes.
              v Manually preempt your job using LoadLeveler interfaces.

              To ensure that your job can be resumed after preemption, set the restart keyword
              in the job command file to yes.

Submitting a job command file
              After building a job command file, you can submit it for processing either to a
              machine in the LoadLeveler cluster or one outside of the cluster.

              See “Querying multiple LoadLeveler clusters” on page 71 for information on
              submitting a job to a machine outside the cluster. You can submit a job command
              file either by using the GUI or the llsubmit command.

              When you submit a job, LoadLeveler assigns a job identifier and one or more step
              identifiers.

              The LoadLeveler job identifier consists of the following:



                                                                  Chapter 8. Building and submitting jobs    193
machine name
                              The name of the machine which assigned the job identifier.
                        jobid    A number given to a group of job steps that were initiated from the same
                                 job command file.

                        The LoadLeveler step identifier consists of the following:
                        job identifier
                                The job identifier.
                        stepid A number that is unique for every job step in the job you submit.

                        If a job command file contains multiple job steps, every job step will have the same
                        jobid and a unique stepid.

                        For an example of submitting a job, see Chapter 10, “Example: Using commands to
                        build, submit, and manage jobs,” on page 235.

                        In a multicluster environment, job and step identifiers are assigned by the local
                        cluster and are retained by the job regardless of what cluster the job runs in.

             Submitting a job using a submit-only machine
                        You can submit jobs from submit-only machines.

                        Submit-only machines allow machines that do not run LoadLeveler daemons to
                        submit jobs to the cluster. You can submit a job using either the submit-only
                        version of the GUI or the llsubmit command.

                        To install submit-only LoadLeveler, follow the procedure in the TWS LoadLeveler:
                        Installation Guide.

                        In addition to allowing you to submit jobs, the submit-only feature allows you to
                        cancel and query jobs from a submit-only machine.

Working with parallel jobs
                        LoadLeveler allows you to schedule parallel batch jobs.

                        LoadLeveler allows you to schedule parallel batch jobs that have been written
                        using the following:
                        v On AIX and Linux:
                          – IBM Parallel Environment (PE)
                          – MPICH, which is an open-source, portable implementation of the
                             Message-Passing Interface Standard developed by Argonne National
                             Laboratory
                          – MPICH-GM, which is a port of MPICH on top of Myrinet GM code
                        v On Linux:
                          – MVAPICH, which is a high performance implementation of MPI-1 over
                             InfiniBand based on MPICH support for PE is available in this release of
                             LoadLeveler for Linux




194   TWS LoadLeveler: Using and Administering
Step for controlling whether LoadLeveler copies environment
    variables to all executing nodes
          You may specify that LoadLeveler is to copy, either to all executing nodes or to
          only the master executing node, the environment variables that are specified in the
          environment job command file statement for a parallel job.

          Before you begin: You need to know:
          v Whether Parallel Environment (PE) will be used to run the parallel job; if so,
            then LoadLeveler does not have to copy the application environment to the
            executing nodes.
          v How to correctly specify the env_copy keyword. For information about keyword
            syntax and other details, see the env_copy keyword description.
          v To specify whether LoadLeveler is to copy environment variables to only the
            master node, or to all executing nodes, use the #@ env_copy keyword in the job
            command file.

    Ensuring that parallel jobs in a cluster run on the correct
    levels of PE and LoadLeveler software
          If support for parallel POE jobs is required, users must be aware that when
          LoadLeveler uses Parallel Environment for parallel job submission, that the PE
          software requires the same level of PE to be used throughout the parallel job.

|         Different levels of PE cannot be mixed. For example, PE 5.1 supports only
|         LoadLeveler 3.5, and PE 4.3 only supports LoadLeveler 3.4.3. Therefore, a POE
|         parallel job cannot run some of its tasks on LoadLeveler 3.4.3 machines and the
|         remaining tasks on LoadLeveler 3.5 machines.

          The requirements keyword of the job command file can be used to ensure that all
          the tasks of a POE job run on compatible levels of PE and LoadLeveler software in
          a cluster. Here are three examples showing different ways this can be done:
|         1. If the following requirements statement is included in the job command file,
|             LoadLeveler’s central manager will select only 3.5 or higher machines with the
|             appropriate OpSys level for this job step.
|            # @ requirements = (LL_Version >= "3.5") && (OpSys == "AIX53")
          2. If a requirements statement such as the following is specified, the tasks of a
             POE job will see a consistent environment when ″hostname1″ and ″hostname2″
             run the same levels of PE and LoadLeveler software.
             # @ requirements = (Machine == { "hostname1" "hostname2" }) && (OpSys == "AIX53")
|         3. If the mixed cluster has been partitioned into 3.4.3 and 3.5 LoadLeveler pools,
|            then you may use a requirements statement similar to one of the two following
|            statements to select machines running the same levels of software.
|            v # @ requirements = (Pool == 35) && (OpSys == "AIX53")
|            v # @ requirements = (Pool == 343) && (OpSys == "AIX53")
|            Here, it is assumed that all the 3.4.3 machines in this mixed cluster are assigned
|            to pool 343 and all 3.5 machines are assigned to pool 35. A LoadLeveler
|            administrator can use the pool_list keyword of the machine stanza of the
|            LoadLeveler administration file to assign machines to pools.

          If a statement such as # @ executable = /bin/poe is specified in a job command
          file, and if the job is intended to be run on 3.5 machines, then it is important that
          the job be submitted from a 3.5 machine. When the ″executable″ keyword is used,
          LoadLeveler will copy the associated binary on the submitting machine and send it
                                                           Chapter 8. Building and submitting jobs   195
to a running machine for execution. In this example, the POE program will fail if
                            the submitting and the running machines are at different software levels. In a
                            mixed cluster, this problem can be circumvented by not using the executable
                            keyword in the job command file. By omitting this keyword, the job command file
                            itself is the shell script that will be executed. If this script invokes a local version of
                            the POE binary then there is no compatibility problem at run time.

                 Task-assignment considerations
                            You can use the keywords to specify how LoadLeveler assigns tasks to nodes.

                            You can use the keywords listed in Table 44 to specify how LoadLeveler assigns
                            tasks to nodes. With the exception of unlimited blocking, each of these methods
                            prioritizes machines in an order based on their MACHPRIO expressions. Various
                            task assignment keywords can be used in combination, and others are mutually
                            exclusive.
|                           Table 44. Valid combinations of task assignment keywords are listed in each column
|                                         Keyword                                   Valid Combinations
|                           total_tasks                          X            X
|                           tasks_per_node                                                  X            X
|                           node = <min, max>                                               X
|                           node = <number>                      X                                       X
|                           task_geometry                                                                        X
|                           blocking                                          X
|

                            The following examples show how each allocation method works. For each
                            example, consider a 3-node SP with machines named ″N1,″ ″N2,″ and ″N3″. The
                            machines’ order of priority, according to the values of their MACHPRIO
                            expressions, is: N1, N2, N3. N1 has 4 initiators available, N2 has 6, and N3 has 8.

                            node and total_tasks
                            When you specify the node keyword with the total_tasks keyword, the assignment
                            function will allocate all of the tasks in the job step evenly among however many
                            nodes you have specified.

                            If the number of total_tasks is not evenly divisible by the number of nodes, then
                            the assignment function will assign any larger groups to the first nodes on the list
                            that can accept them. In this example, 14 tasks must be allocated among 3 nodes:
                            # @ node=3
                            # @ total_tasks=14

                            Table 45 shows the machine, available initiators, and assigned tasks:
                            Table 45. node and total_tasks
                            Machine                          Available Initiators            Assigned Tasks
                            N1                               4                               4
                            N2                               6                               5
                            N3                               8                               5


                            The assignment function divides the 14 tasks into groups of 5, 5, and 4, and begins
                            at the top of the list, to assign the first group of 5. The assignment function starts

    196   TWS LoadLeveler: Using and Administering
at N1, but because there are only 4 available initiators, cannot assign a block of 5
tasks. Instead, the function moves down the list and assigns the two groups of 5 to
N2 and N3, the assignment function then goes back and assigns the group of 4
tasks to N1.

node and tasks_per_node
When you specify the node keyword with the tasks_per_node keyword, the
assignment function will assign tasks in groups of the specified value among the
specified number of nodes.
# @ node = 3
# @ tasks_per_node = 4

blocking
When you specify blocking, tasks are allocated to machines in groups (blocks) of
the specified number (blocking factor).

The assignment function will assign one block at a time to the machine which is
next in the order of priority until all of the tasks have been assigned. If the total
number of tasks are not evenly divisible by the blocking factor, the remainder of
tasks are allocated to a single node. The blocking keyword must be specified with
the total_tasks keyword. For example:
# @ blocking = 4
# @ total_tasks = 17

Where blocking specifies that a job’s tasks will be assigned in blocks, and 4
designates the size of the blocks. Table 46 shows how a blocking factor of 4 would
work with 17 tasks:
Table 46. Blocking
Machine                      Available Initiators            Assigned Tasks
N1                           4                               4
N2                           6                               5
N3                           8                               8

The assignment function first determines that there will be 4 blocks of 4 tasks, with
a remainder of one task. Therefore, the function will allocate the remainder with
the first block that it can. N1 gets a block of four tasks, N2 gets a block, plus the
remainder, then N3 gets a block. The assignment function begins again at the top
of the priority list, and N3 is the only node with enough initiators available, so N3
ends up with the last block.

unlimited blocking
When you specify unlimited blocking, the assignment function will allocate as
many jobs as possible to each node; the function prioritizes nodes primarily by
how many initiators each node has available, and secondarily on their MACHPRIO
expressions.

This method allows you to allocate tasks among as few nodes as possible. To
specify unlimited blocking, specify ″unlimited″ as the value for the blocking
keyword. The total_tasks keyword must also be specified with unlimited blocking.
For example:
# @ blocking = unlimited
# @ total_tasks = 17

Table 47 on page 198 lists the machine, available initiators, and assigned tasks for
unlimited blocking:

                                                Chapter 8. Building and submitting jobs   197
Table 47. Unlimited blocking
                        Machine                        Available Initiators       Assigned Tasks
                        N3                             8                          8
                        N2                             6                          6
                        N1                             4                          3

                        The assignment function begins with N3 (because N3 has the most initiators
                        available), and assigns 8 tasks, N2 takes six, and N1 takes the remaining 3.

                        task_geometry
                        The task_geometry keyword allows you to specify which tasks run together on the
                        same machines, although you cannot specify which machines.

                        In this example, the task_geometry keyword groups 7 tasks to run on 3 nodes:
                        # @ task_geometry = {(5,2)(1,3)(4,6,0)}

                        The entire task_geometry expression must be enclosed within braces. The task IDs
                        for each node must be enclosed within parenthesis, and must be separated by
                        commas. The entire range of task IDs that you specify must begin with zero, and
                        must end with the task ID which is one less than the total number of tasks. You
                        can specify the task IDs in any order, but you cannot skip numbers (the range of
                        task IDs must be complete). Commas may only appear between task IDs, and
                        spaces may only appear between nodes and task IDs.

             Submitting jobs that use striping
                        When communication between parallel tasks occurs only over a single device such
                        as en0, the application and the device are gated by each other.

                        The device must wait for the application to fill a communication buffer before it
                        transmits the buffer and the application must wait for the device to transmit and
                        empty the buffer before it can refill the buffer. Thus the application and the device
                        must wait for each other and this wastes time.

                        The technique of striping refers to using two or more communication paths to
                        implement a single communication path as perceived by the application. As the
                        application sends data, it fills up a buffer on one device. As that buffer is
                        transmitted over the first device, the application’s data begins filling up a second
                        buffer and the application perceives no delay in being able to write. When the
                        second buffer is full, it begins transmission over the second device and the
                        application moves on to the next device. When all devices have been used, the
                        application returns to the first device. Much, if not all of the buffer on the first
                        device has been transmitted while the application wrote to the buffers on the other
                        devices so the application waits for a minimal amount of time or possibly does not
                        wait at all.

                        LoadLeveler supports striping in two ways. When multiple switch planes or
                        networks are present, striping over them is indicated by requesting sn_all
                        (multiple networks).

                        If multiple adapters are present on the same network and the communication
                        subsystem, such as LAPI, supports striping over multiple adapters on the same
                        network, specifying the instances keyword on the network statement requests
                        striping over adapters on the same network. The instances keyword specifies the
                        number of adapters on a single network to stripe on. It is possible to stripe over

198   TWS LoadLeveler: Using and Administering
multiple networks and over multiple adapters on each network by specifying both
sn_all and a value for instances greater than one. For HPS adapters, only
machines that are connected to both networks are considered for sn_all jobs.
v User space striping: When sn_all is specified on a network statement with US
  mode, LoadLeveler commits an equivalent set of adapter resources (adapter
  windows and memory) on each of the networks present in the system to the job
  on each node where the job runs. The communication subsystem is initialized to
  indicate that it should use the user space communication protocol on all the
  available switch adapters to service communication requests on behalf of the
  application.
v IP striping: When the sn_all device is specified on a network statement with the
  IP mode, LoadLeveler attempts to locate the striped IP address associated with
  the switch adapters, known as the multi-link address. If it is successful, it passes
  the multi-link address to POE for use. If multi-link addresses are not available,
  LoadLeveler instructs POE to use the IP address of one of the switch adapters.
  The IP address that is used is different each time a choice has to be made in an
  attempt to balance the adapter use. Multi-link addresses must be configured on
  the system prior to running LoadLeveler and they are specified with the
  multilink_address keyword on the switch adapter stanza in the administration
  file. If a multi-link address is specified for a node, LoadLeveler assigns the
  multi-link address and multi-link IP name to the striping adapter on that node.
  If a multi-link address is not present on a node, the sn_all adapter associated
  with the node will not have an IP address or IP name. If not all of the nodes of
  a system have multi-link addresses but some do, LoadLeveler will only dispatch
  jobs that request IP striping to nodes that have multi-link addresses.
  Jobs that request striping (both user space and IP) can be submitted to nodes
  with only one switch adapter. In that situation, the result is the same as if the
  job requested no striping.

  Note: When configured, a multi-link address is associated with the virtual ml0
          device. The IP address of this device is the multi-link address. The
          llextRPD program will create a stanza for the ml0 device that will appear
          similar to Ethernet or token ring adapter stanzas except that it will
          include the multilink_list keyword that lists the adapters it performs
          striping over. As with any other device with an IP address, the ml0 device
          can be requested in IP mode on the network statement. Doing so would
          yield a comparable effect to requesting sn_all IP except that no checking
          would be performed by LoadLeveler to ensure the associated adapters are
          actually working. Thus it would be possible to dispatch a job that
          requested communication over ml0 only to have the job fail because the
          switch adapters that ml0 stripes over were down.
v Striping over one network: If the instances keyword is specified on a network
  statement with a value greater than one, LoadLeveler allocates multiple sets of
  resources for the protocol using as many sets as the instances keyword
  specified. For User Space jobs, these sets are adapter windows and memory. For
  IP jobs, these sets are IP addresses. If multiple adapters exist on each node on
  the same network, then these sets of adapter resources will be distributed among
  all the available adapters on the same network. Even though LoadLeveler will
  allocate resources to support striping over a single network, the communication
  subsystem must be capable of exploiting these resources in order for them to be
  used.

Understanding striping over multiple networks
Striping over multiple networks involves establishing a communication path using
one or more of the available communication networks or switch fabrics.

                                              Chapter 8. Building and submitting jobs   199
How those paths are established depends on the network adapter that is present.
                        For the SP Switch2 family of adapters, it is not necessary to acquire communication
                        paths among all tasks on all fabrics as long as there is at least one fabric over
                        which all tasks can communicate. However, each adapter on a machine, if it is
                        available, must use exactly the same adapter resources (window and memory
                        amount) as the other adapters on that machine. Switch Network Interface for HPS
                        adapters are not required to use exactly the same resources on each network, but
                        in order for a machine to be selected, there must be an available communication
                        path on all networks.


                                                         Node 1

                                                             Adapter A
                                                 fault
                                                             Adapter B


                                                         Node 2

                                                             Adapter A

                                                             Adapter B
                           Switch                                                 Switch
                          Network A                      Node 3                  Network B

                                                             Adapter A   fault
                                                             Adapter B


                                                         Node 4

                                                             Adapter A
                                                 fault       Adapter B




                        Figure 24. Striping over multiple networks

                        Consider these sample scenarios using the network configuration as shown in
                        Figure 24 where the adapters are from the SP Switch2 family:
                        v If a three node job requests striping over networks, it will be dispatched to Node
                          1, Node 2 and Node 4 where it can communicate on Network B as long as the
                          adapters on each machine have a common window free and sufficient memory
                          available. It cannot run on Node 3 because that node only has a common
                          communication path with Node 2, namely Network A.
                        v If a three node job does not request striping, it will not be run because there are
                          not enough adapters connected to Network A to run the job. Notice both the
                          adapter connected to Network A on Node 1 and the adapter connected to
                          Network A on Node 4 are both at fault. SP Switch2 family adapters can only use
                          the adapter connected to Network A for non-striped communication.




200   TWS LoadLeveler: Using and Administering
v If a three node job requests striped IP and some but not all of the nodes have
  multi-linked addresses, the job will only be dispatched to the nodes that have
  the multi-link addresses.

Consider these sample scenarios using the network configuration as shown in
Figure 24 on page 200 where the adapters are Switch Network Interface for HPS
adapters:
v If a three node job requests striping over networks, it will not be dispatched
  because there are not three nodes that have active connections to both networks.
v If a three node job does not request striping, it can be run on Node 1, Node 2,
  and Node 4 because they have an active connection to network B.
v If a three node job requests striped IP and some but not all of the nodes have
  multi-linked addresses, the job will only be dispatched to the nodes that have
  the multi-link addresses.

Note that for all adapter types, adapters are allocated to a step that requests
striping based on what the node knows is the available set of networks or fabrics.
LoadLeveler expects each node to have the same knowledge about available
networks. If this is not true, it is possible for tasks of a step to be assigned
adapters which cannot communicate with tasks on other nodes.

Similarly, LoadLeveler expects all adapters that are identified as being on the same
Network ID or fabric ID to be able to communicate with each other. If this is not
true, such as when LoadLeveler operates with multiple, independent sets of
networks, other attributes of the Step, such as the requirements expression, must
be used to ensure that only nodes from a single network set are considered for the
step.

As you can see from these scenarios, LoadLeveler will find enough nodes on the
same communication path to run the job. If enough nodes connected to a common
communication path cannot be found, no communication can take place and the
job will not run.

Understanding striping over a single network
Striping over a single network is only supported by Switch Network Interface for
HPS adapters.

Figure 25 on page 202 shows a network configuration where the adapters support
striping over a single network.




                                             Chapter 8. Building and submitting jobs   201
Instance 0

                           Node 1                                                                 Instance 1

                                                                        A                         Instance 2
                                  Adapter A
                                                                        B
                                  Adapter B


                           Node 2
                                                                             Switch
                                                                    A       Network 0
                                  Adapter A
                                  Adapter B
                                                               B

                           Node 3
                                                                    A
                                  Adapter A                             A
                                  Adapter B
                                                    fault




                        Figure 25. Striping over a single network

                        Both Adapter A and Adapter B on a node are connected to Network 0. The entire
                        oval represents the physical network and the concentric ovals (shaded differently)
                        represent the separate communication paths created for a job by the instances
                        keyword on the network statement. In this case a three node job requests two
                        instances for communication. On Node 1, adapter A is used for instance 0 and
                        adapter B is used for instance 1. There is no requirement to use the same adapter
                        for the same instance so on Node 2, adapter B was used for instance 0 and adapter
                        A for instance 1.

                        On Node 3, where a fault is keeping adapter B from connecting to the network,
                        adapter A is used for both instance 0 and instance 1 and Node 3 is available for
                        the job to use.

                        The network itself does not impose any limitation on the total number of
                        communication paths that can be active at a given time for either a single job or all
                        the jobs using the network. As long as nodes with adapter resources are available,
                        additional communication paths can be created.

                        Examples: Requesting striping in network statements
                        You request that a job be run using striping with the network statement in your
                        job command file.

                        The default when instances is not specified for a job in the network statement is
                        controlled by the class stanza keyword for sn_all. For more information on the
                        network and max_protocol_instances statements, see the keyword descriptions in
                        “Job command file keyword descriptions” on page 359.

                        Shown here are examples of IP and user space network modes:
                        v Example 1: Requesting striping using IP mode
                          To submit a job using IP striping, your network statement would look like this:

202   TWS LoadLeveler: Using and Administering
network.MPI = sn_all,,IP
      v Example 2: Requesting striping using user space mode
        To submit a job using user space striping, your network statement would look
        like this:
        network.MPI = sn_all,,US
      v Example 3: Requesting striping over a single network
        To request IP striping over multiple adapter on a single network, the network
        statement would look like this:
        network.MPI = sn_single,,IP,,instances=2

        If the nodes on which the job runs have two or more adapters on the same
        network, two different IP addresses will be allocated to each task for MPI
        communication. If only one adapter exists per network, the same IP address will
        be used twice for each task for MPI communication.
      v Example 4: Requesting striping over multiple networks and multiple adapters
        on the same network
        To submit a user space job that will stripe MPI communication over multiple
        adapters on all networks present in the system the network statement would
        look like this:
        network.MPI = sn_all,,US,,instances=2

        If, on a node where the job runs, there are two adapters on each of the two
        networks, one adapter window would be allocated from each adapter for MPI
        communication by the job. If only one network were present with two adapters,
        one adapter window from each of the two adapters would be used. If two
        networks were present but each only had one adapter on it, two adapter
        windows from each adapter would be used to satisfy the request for two
        instances.

Running interactive POE jobs
      POE will accept LoadLeveler job command files

      However, you can still set the following environment variables to define specific
      LoadLeveler job attributes before running an interactive POE job:
      LOADL_ACCOUNT_NO
           The account number associated with the job.
      LOADL_INTERACTIVE_CLASS
           The class to which the job is assigned.
      MP_TASK_AFFINITY
           The affinity preferences requested for the job.

      For information on other POE environment variables, see IBM Parallel Environment
      for AIX and Linux: Operation and Use, Volume 1.

      For an interactive POE job, LoadLeveler does not start the POE process therefore
      LoadLeveler has no control over the process environment or resource limits.

      You also may run interactive POE jobs under a reservation. For additional details
      about reservations and submitting jobs to run under them, see “Working with
      reservations” on page 213.

      Interactive POE jobs cannot be submitted to a remote cluster.

                                                   Chapter 8. Building and submitting jobs   203
Running MPICH, MVAPICH, and MPICH-GM jobs
|                           LoadLeveler for AIX andLoadLeveler for Linux support three open-source
|                           implementations of the Message-Passing Interface (MPI).

                            MPICH is an open-source, portable implementation of the MPI Standard
                            developed by Argonne National Laboratory. It contains a complete implementation
                            of version 1.2 of the MPI Standard and also significant parts of MPI-2, particularly
                            in the area of parallel I/O. MPICH, MVAPICH, and MPICH-GM are the three MPI
|                           implementations supported by LoadLeveler for AIX and LoadLeveler for Linux:
                            v Additional documentation for MPICH is available from the Argonne National
                               Laboratory web site at:
                               http://www-unix.mcs.anl.gov/mpi/mpich1/
                            v MVAPICH is a high performance implementation of MPI-1 over InfiniBand
                               based on MPICH. Additional documentation for MVAPICH is available at the
                               Ohio State University Web site at:
                               http://mvapich.cse.ohio-state.edu/
                            v MPICH-GM is a port of MPICH on top of GM (ch_gm). GM is a low-level
                              message-passing system for Myrinet Networks. Additional documentation for
                              MPICH-GM is available from the Myrinet web site at:
                               http://www.myri.com/scs/

                            For either MPICH, MVAPICH, or MPICH-GM, LoadLeveler allocates the machines
                            to run the parallel job and starts the implementation specific script as master task.
                            Some of the options of implementation specific scripts might not be required or are
                            not supported when used with LoadLeveler.

                            The following standard mpirun script options are not supported:
                            -map <list>
                               The mpirun script can either take a machinefile or a mapping of the machines
                               in which to run the mpirun job. If both the machinefile and map are specified,
                               then the map list overrides the machinefile. Because we want LoadLeveler to
                               decide which nodes to run on, use the machinefile specified by the
                               environment variable LOADL_HOSTFILE. Specifying a mapping of the host
                               name is not supported.
                            -allcpus
                                 This option is only supported when the -machinefile option is used. The
                                 mpirun script will run the job using all machines specified in the machine file,
                                 without the need to specify the -np option. Without specifying machinefile,
                                 the mpirun script will look in the default machines <arch> file to find the
                                 machines on which to run the job. The machines defined in the default file
                                 might not match what LoadLeveler has selected, which will cause the job to be
                                 removed.
                            -exclude <list>
                                This option is not supported because if you specified a machine in the exclude
                                list that has already been scheduled by LoadLeveler to run the job, the job will
                                be removed.
                            -dbg
                               This option might be used to select a debugger. This option is used to select a
                               debugger to be used with the mpirun script. LoadLeveler currently does not
                               support running interactive MPICH jobs, so starting mpirun jobs under a
                               debugger is not supported.

    204   TWS LoadLeveler: Using and Administering
-ksq
    This option keeps the send queue. This is useful if you expect later to attach
    totalview to the running (or deadlocked) job, and want to see the send queues.
    This option is used for debugging purposes when attaching the mpirun job to
    totalview. Since we do not support running debuggers under LoadLeveler
    MPICH job management, this option is not supported.
-machinedir <directory>
    This option looks for the machine files in the indicated directory. LoadLeveler
    will create a machinefile that contains the host name for each task in the
    mpirun job. The environment variable LOADL_HOSTFILE contains the full
    path to the machinefile. A different machinefile is created per job and stored
    in the LoadLeveler execute directory. Because there might be multiple jobs
    running at one time, we do not want the mpirun script to choose any file in
    the execute directory because it might not be the correct file that the central
    manager has assigned to the job step. This option is therefore not supported,
    use the -machinefile option instead.
v When using MPICH, the mpirun script is run on the first machine allocated to
  the job. The mpirun script starts the actual execution of the parallel tasks on the
  other nodes included in the LoadLeveler cluster using llspawn.stdio as
  RSHCOMMAND.
  The following option of MPICHs mpirun script is not supported.
  -nolocal
      This option specifies not to run on the local machine. The default behavior
      of MPICH (p4) is that the first MPI process is always spawned on the
      machine which mpirun has invoked. The -nolocal option disables the
      default behavior and does not run the MPI process on the local node. Under
      LoadLeveler’s MPICH Job management, it is required that at least one task
      is run on the local node, so the -nolocal option should not be used.
v When using MVAPICH, the mpirun_rsh command is run on the first machine
  allocated to the job as master task. The mpirun_rsh command starts the actual
  execution of parallel tasks on the other nodes included in the LoadLeveler
  cluster using llspawn as RSHCOMMAND.
  The following options of MVAPICHs mpirun_rsh command are not supported
  when used with LoadLeveler.
  -rsh
         Specifies to use rsh for connecting.
  -ssh
      Specifies to use ssh for connecting. The -rsh and -ssh options are supported,
      but the behavior has been changed to run mpirun_rsh jobs under
      LoadLeveler MPICH job manager. Replace the -rsh and -ssh commands with
      llspawn before compiling mpirun_rsh. Even if you select -rsh and -ssh, the
      llspawn command is actually used in place of -rsh and -ssh at runtime.
  -xterm
      Runs remote processes under xterm. This option starts an xterm window for
      each task in the mpirun job and runs the remote shell with the application
      inside the xterm window. This will not work under LoadLeveler because the
      llspawn command replaces the remote shell (rsh or ssh) and llspawn is not
      kept alive to the end of the application process.
  -debug
      Runs each process under the control of gdb. This option is used to select a
      debugger to be used with mpirun jobs. LoadLeveler currently does not
      support running interactive MPICH jobs so starting mpirun jobs under a

                                                Chapter 8. Building and submitting jobs   205
debugger is not supported. This option also requires xterm to be working
                                properly as it opens gdb under an xterm window. Since we do not support
                                the -xterm option, the -debug option is also not supported.
                          h1 h2....
                              Specifies the names of hosts where processes should run. The mpirun_rsh
                              script can either take a host file or read in the names of the hosts, h1 h2 and
                              so on, in which to run the mpirun job. If both host file and list of machines
                              are specified in the mpirun_rsh arguments, mpirun_rsh will have an error
                              parsing the arguments. Because we want LoadLeveler to decide which nodes
                              to run on, you should use the host list specified by the environment variable
                              LOADL_HOSTFILE. Specifying the names of the hosts is not supported.
                        v When using MPICH-GM, the mpirun.ch_gm script is run on the first machine
                          allocated to the job as master task. The mpirun.ch_gm script starts the actual
                          execution of the parallel tasks on the other nodes included in the LoadLeveler
                          cluster using the llspawn command as RSHCOMMAND.
                          The following options of MPICH-GMs mpirun script are not supported when
                          used with LoadLeveler.
                           --gm-kill <n>
                               This is an option that allows you to kill all remaining processes <n> seconds
                               after the first one dies or exits. Do not specify this option when running the
                               application under LoadLeveler, because LoadLeveler will handle the cleanup
                               of the tasks.
                           --gm-tree-spawn
                               This is an option that uses a two-level spawn tree to launch the processes in
                               an effort to reduce the load on any particular host. Because LoadLeveler is
                               providing its own scalable method for spawning the application tasks from
                               the master host, using the llspawn command, spawning processes in a
                               tree-like fashion is not supported.
                           -totalview
                               This option is used to select a totalview debugging session to be used with
                               the mpirun script. LoadLeveler currently does not support running
                               interactive MPICH jobs, so starting mpirun jobs under a debugger is not
                               supported.
                           -r   This is an optional option for MPICH-GM, which forces the removal of the
                                shared memory files. Because this option is not required, it is not supported.
                                If you specify this option, it will be ignored.
                           -ddt
                               This option is used to select a DDT debugging session to be used with the
                               mpirun script. LoadLeveler currently does not support running interactive
                               MPICH jobs, so starting mpirun jobs under a debugger is not supported.

                        Sample programs are available:
                        v See “MPICH sample job command file” on page 208 for a sample MPICH job
                          command file.
                        v See “MPICH-GM sample job command file” on page 209 for a sample
                          MPICH-GM job command file.
                        v See “MVAPICH sample job command file” on page 211 for a sample MVAPICH
                          job command file.
                        v The LoadLeveler samples directory also contains sample files:
                          – On AIX, use directory /usr/lpp/LoadL/full/samples/llmpich
                          – On Linux, use directory /opt/ibmll/LoadL/full/samples/llmpich


206   TWS LoadLeveler: Using and Administering
These sample files include:
          – ivp.c: A simple MPI application that you may run as an MPICH, MVAPICH,
            or MPICH-GM job.
          – Job command files to run the ivp.c program as a batch job:
            - For MPICH: mpich_ivp.cmd
            - For MPICH-GM: mpich_gm_ivp.cmd

Examples: Building parallel job command files
      This topic contains sample job command files for several parallel environments.

      This topic contains sample job command files for the following parallel
      environments:
      v IBM AIX Parallel Operating Environment (POE)
      v MPICH
      v MPICH-GM
      v MVAPICH

      POE sample job command file
      This is a sample job command file for POE.

      Figure 26 is a sample job command file for POE.


      #
      #   @   job_type = parallel
      #   @   environment = COPY_ALL
      #   @   output = poe.out
      #   @   error = poe.error
      #   @   node = 8,10
      #   @   tasks_per_node = 2
      #   @   network.LAPI = sn_all,US,,instances=1
      #   @   network.MPI = sn_all,US,,instances=1
      #   @   wall_clock_limit = 60
      #   @   executable = /usr/bin/poe
      #   @   arguments = /u/richc/My_POE_program -euilib "us"
      #   @   class = POE
      #   @   queue

      Figure 26. POE job command file – multiple tasks per node

      Figure 26 shows the following:
      v The total number of nodes requested is a minimum of eight and a maximum of
        10 (node=8,10). Two tasks run on each node (tasks_per_node=2). Thus the total
        number of tasks can range from 16 to 20.
      v Each task of the job will run using the LAPI protocol in US mode with a switch
        adapter (network.LAPI=sn_all,US,,instances=1), and using the MPI protocol in
        US mode with a switch adapter (network.MPI=sn_all,US,,instances=1).
      v The maximum run time allowed for the job is 60 seconds (wall_clock_limit=60).

      Figure 27 on page 208 is a second sample job command file for POE




                                                       Chapter 8. Building and submitting jobs   207
#
                        # @ job_type = parallel
                        # @ input = poe.in.1
                        # @ output = poe.out.1
                        # @ error = poe.err
                        # @ node = 2,8
                        # @ network.MPI = sn_single,shared,IP
                        # @ wall_clock_limit = 60
                        # @ class = POE
                        # @ queue
                        /usr/bin/poe /u/richc/my_POE_setup_program -infolevel 2
                        /usr/bin/poe /u/richc/my_POE_main_program -infolevel 2

                        Figure 27. POE sample job command file – invoking POE twice

                        Figure 27 shows the following:
                        v POE is invoked twice, through my_POE_setup_program and
                          my_POE_main_program.
                        v The job requests a minimum of two nodes and a maximum of eight nodes
                          (node=2,8).
                        v The job by default runs one task per node.
                        v The job uses the MPI protocol with a switch adapter in IP mode
                          (network.MPI=sn_single,shared,IP).
                        v The maximum run time allowed for the job is 60 seconds (wall_clock_limit=60).

                        MPICH sample job command file
                        This is a sample job command file for MPICH.

                        Figure 28 is a sample job command file for MPICH.

                        # ! /bin/ksh
                        # LoadLeveler JCF file for running an MPICH job
                        # @ job_type = MPICH
                        # @ node = 4
                        # @ tasks_per_node = 2
                        # @ output = mpich_test.$(cluster).$(process).out
                        # @ error = mpich_test.$(cluster).$(process).err
                        # @ queue
                        echo "------------------------------------------------------------"
                        echo LOADL_STEP_ID=$LOADL_STEP_ID
                        echo "------------------------------------------------------------"

                        /opt/mpich/bin/mpirun -np $LOADL_TOTAL_TASKS -machinefile 
                         $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

                        Figure 28. MPICH job command file - sample 1

                        Note: You can also specify the job_type=parallel keyword and invoke the mpirun
                              script to run an MPICH job. In that case, the mpirun script would use rsh
                              or ssh and not the llspawn command.

                        Figure 28 shows that in the following job command file statement:
                        /opt/mpich/bin/mpirun -np $LOADL_TOTAL_TASKS -machinefile 
                        $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test
                        -np
                              Specifies the number of parallel processes.
                        LOADL_TOTAL_TASKS
                          Is the environment variable set by LoadLeveler with the number of parallel
                          processes of the job step.

208   TWS LoadLeveler: Using and Administering
-machinefile
   Specifies the machine list file.
LOADL_HOSTFILE
  Is the environment variable set by LoadLeveler with the file name that contains
  host names assigned to the parallel job step.

The following is another example of a MPICH job command file:

# ! /bin/ksh
# LoadLeveler JCF file for running an MPICH job
# @ job_type = MPICH
# @ node = 4
# @ tasks_per_node = 2
# @ output = mpich_test.$(cluster).$(process).out
# @ error = mpich_test.$(cluster).$(process).err
# @ executable = /opt/mpich/bin/mpirun
# @ arguments = -np $LOADL_TOTAL_TASKS -machinefile 
 $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test
# @ queue

Figure 29. MPICH job command file - sample 2

Figure 29 shows the following:
v The mpirun script is specified as a value of the executable job command file
  keyword.
v The following mpirun script arguments are specified with the arguments job
  command file keyword:
  -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

  -np
        Specifies the number of parallel processes.
  LOADL_TOTAL_TASKS
    Is the environment variable set by LoadLeveler with the number of parallel
    processes of the job step.
  -machinefile
     Specifies the machine list file.
  LOADL_HOSTFILE
    Is the environment variable set by LoadLeveler with file name, which
    contains host names assigned to the parallel job step.

MPICH-GM sample job command file
This is a sample job command file for MPICH-GM.

Figure 30 on page 210 is a sample job command file for MPICH-GM.




                                                 Chapter 8. Building and submitting jobs   209
#! /bin/ksh
                        # LoadLeveler JCF file for running an MPICH-GM job
                        # @ job_type = MPICH
                        # @ resources = gmports(1)
                        # @ node = 4
                        # @ tasks_per_node = 2
                        # @ output = mpich_gm_test.$(cluster).$(process).out
                        # @ error = mpich_gm_test.$(cluster).$(process).err
                        # @ queue
                        echo "------------------------------------------------------------"
                        echo LOADL_STEP_ID=$LOADL_STEP_ID
                        echo "------------------------------------------------------------"
                        /opt/mpich/bin/mpirun.ch_gm -np $LOADL_TOTAL_TASKS -machinefile 
                        $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test

                        Figure 30. MPICH-GM job command file - sample 1

                        Figure 30 shows the following:
                        v The statement # @ resources = gmports(1) specifies that each task consumes one
                          GM port. This is how LoadLeveler limits the number of GM ports
                          simultaneously in use on any machine. This resource name is the name you
                          specified in schedule_by_resources in the configuration file and each machine
                          stanza in the administration file must define GM ports and specify the quantity
                          of GM ports available on each machine. Use the llstatus -R command to confirm
                          the names and values of the configured and available consumable resources.
                        v In the following job command file statement:
                           /opt/mpich/bin/mpirun.ch_gm -np $LOADL_TOTAL_TASKS 
                             -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test
                           /opt/mpich/bin/mpirun.ch_gm
                               Specifies the location of the mpirun.ch_gm script shipped with the
                               MPICH-GM implementation that runs the MPICH-GM application.
                           -np
                                 Specifies the number of parallel processes.
                           -machinefile
                              Specifies the machine list file.
                           LOADL_HOSTFILE
                             Is the environment variable set by LoadLeveler with file name, which
                             contains host names assigned to the parallel job step.

                        Figure 31 is another sample job command file for MPICH-GM.

                        #! /bin/ksh
                        # LoadLeveler JCF file for running an MPICH-GM job
                        # @ job_type = MPICH
                        # @ resources = gmports(1)
                        # @ node = 4
                        # @ tasks_per_node = 2
                        # @ output = mpich_gm_test.$(cluster).$(process).out
                        # @ error = mpich_gm_test.$(cluster).$(process).err
                        # @ executable = /opt/mpich/bin/mpirun.ch_gm
                        # @ arguments = -np $LOADL_TOTAL_TASKS -machinefile 
                        $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test
                        # @ queue

                        Figure 31. MPICH-GM job command file - sample 2

                        Figure 31 shows the following:
                        v The mpirun_gm script is specified as value of the executable job command file
                          keyword.
210   TWS LoadLeveler: Using and Administering
v The following mpirun_gm script arguments are specified with the arguments job
  command file keyword:
  -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

  -np
        Specifies the number of parallel processes.
  LOADL_TOTAL_TASKS
    Is the environment variable set by LoadLeveler with the number of parallel
    processes of the job step.
  -machinefile
     Specifies the machine list file.
  LOADL_HOSTFILE
    Is the environment variable set by LoadLeveler with file name, which
    contains host names assigned to the parallel job step.

MVAPICH sample job command file
This is a sample job command file for MVAPICH.

Figure 32 is a sample job command file for MVAPICH:

# ! /bin/ksh
# LoadLeveler JCF file for running an MVAPICH job
# @ job_type = MPICH
# @ node = 4
# @ tasks_per_node = 2
# @ output = mvapich_test.$(cluster).$(process).out
# @ error = mvapich_test.$(cluster).$(process).err
# @ queue
echo "------------------------------------------------------------"
echo LOADL_STEP_ID=$LOADL_STEP_ID
echo "------------------------------------------------------------"

/opt/mpich/bin/mpirun_rsh -np $LOADL_TOTAL_TASKS -machinefile 
 $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

Figure 32. MVAPICH job command file - sample 1

Figure 32 shows that in the following job command file statement:
/opt/mpich/bin/mpirun_rsh -np $LOADL_TOTAL_TASKS -machinefile 
 $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test
-np
      Specifies the number of parallel processes.
LOADL_TOTAL_TASKS
  Is the environment variable set by LoadLeveler with the number of parallel
  processes of the job step.
-machinefile
   Specifies the machine list file.
LOADL_HOSTFILE
  Is the environment variable set by LoadLeveler with file name, which contains
  host names assigned to the parallel job step.

Figure 32 is another sample job command file for MVAPICH:




                                                 Chapter 8. Building and submitting jobs   211
# ! /bin/ksh
                        # LoadLeveler JCF file for running an MVAPICH job
                        # @ job_type = MPICH
                        # @ node = 4
                        # @ tasks_per_node = 2
                        # @ output = mvapich_test.$(cluster).$(process).out
                        # @ error = mvapich_test.$(cluster).$(process).err
                        # @ executable = /opt/mpich/bin/mpirun_rsh
                        # @ arguments = -np $LOADL_TOTAL_TASKS -machinefile 
                         $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test
                        # @ queue

                        Figure 33. MVAPICH job command file - sample 2

                        Figure 33 shows the following:
                        v The mpirun_rsh command is specified as value for the executable job command
                          file keyword.
                        v The following mpirun_rsh command arguments are specified with the
                          arguments job command file keyword:
                            -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

                            -np
                                  Specifies the number of parallel processes.
                            LOADL_TOTAL_TASKS
                              Is the environment variable set by LoadLeveler with the number of parallel
                              processes of the job step.
                            -machinefile
                               Specifies the machine list file.
                            LOADL_HOSTFILE
                              Is the environment variable set by LoadLeveler with file name, which
                              contains host names assigned to the parallel job step.

             Obtaining status of parallel jobs
                        Both end users and LoadLeveler administrators can obtain status of parallel jobs in
                        the same way as they obtain status of serial jobs – either by using the llq
                        command or by viewing the Jobs window on the graphical user interface (GUI).

                        By issuing llq -l, or by using the Job Actions → Details selection in xloadl, users get
                        a list of machines allocated to the parallel job. If you also need to see task instance
                        information use the -x option in addition to the -l option (llq -l -x). See “llq -
                        Query job status” on page 479 for samples of output using the -x and -l options
                        with the llq command.

             Obtaining allocated host names
                        llq -l output includes information on allocated host names.

                        Another way to obtain the allocated host names is with the
                        LOADL_PROCESSOR_LIST environment variable, which you can use from a shell
                        script in your job command file as shown in Figure 34 on page 213.

                        This example uses LOADL_PROCESSOR_LIST to perform a remote copy of a local
                        file to all of the nodes, and then invokes POE. Note that the processor list contains
                        an entry for each task running on a node. If two tasks are running on a node,
                        LOADL_PROCESSOR_LIST will contain two instances of the host name where the
                        tasks are running. The example in Figure 34 on page 213 removes any duplicate
                        entries.

212   TWS LoadLeveler: Using and Administering
Note that LOADL_PROCESSOR_LIST is set by LoadLeveler, not by the user. This
              environment variable is limited to 128 hostnames. If the value is greater than the
              128 limit, the environment variable is not set.

              #!/bin/ksh
              # @ output      =   my_POE_program.$(cluster).$(process).out
              # @ error      =    my_POE_program.$(cluster).$(process).err
              # @ class      =    POE
              # @ job_type    =   parallel
              # @ node = 8,12
              # @ network.MPI =   sn_single,shared,US
              # @ queue

              tmp_file="/tmp/node_list"
              rm -f $tmp_file

              # Copy each entry in the list to a new line in a file so
              # that duplicate entries can be removed.
              for node in $LOADL_PROCESSOR_LIST
                      do
                              echo $node >> $tmp_file
                      done

              # Sort the file removing duplicate entries and save list in variable
              nodelist= sort -u /tmp/node_list

              for node in $nodelist
                      do
                              rcp localfile $node:/home/userid
                      done

              rm -f $tmp_file


              /usr/bin/poe /home/userid/my_POE_program

              Figure 34. Using LOADL_PROCESSOR_LIST in a shell script


Working with reservations
              Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make
              reservations, which specify a time period during which specific node resources are
              reserved for use by particular users or groups.

              Use Table 48 to find information about working with reservations.
              Table 48. Roadmap of tasks for reservation owners and users
              Subtask                              Associated instructions (see . . . )
              Learn how reservations work in the   v “Overview of reservations” on page 25
              LoadLeveler environment
                                                   v “Understanding the reservation life cycle” on page
                                                     214
              Creating new reservations            “Creating new reservations” on page 216
              Managing jobs that run under a       v “Submitting jobs to run under a reservation” on
              reservation                            page 218
                                                   v “Removing bound jobs from the reservation” on
                                                     page 220
              Managing existing reservations       v “Querying existing reservations” on page 221
                                                   v “Modifying existing reservations” on page 221
                                                   v “Canceling existing reservations” on page 222



                                                               Chapter 8. Building and submitting jobs   213
Table 48. Roadmap of tasks for reservation owners and users (continued)
                            Subtask                                Associated instructions (see . . . )
                            Using the LoadLeveler interfaces for   v Chapter 16, “Commands,” on page 411
                            reservations                           v “Reservation API” on page 643



                 Understanding the reservation life cycle
                            From the time at which LoadLeveler creates a reservation through the time the
                            reservation ends or is canceled, a reservation goes through various states, which
                            are indicated in command listings and other displays or output.

                            Understanding these states is important because the current state of a reservation
                            dictates what actions you can take; for example, if you want to modify the start
                            time for a reservation, you may do so only while the reservation is in Waiting
                            state. Table 49 lists the possible reservation states, their abbreviations, and usage
                            notes.
                            Table 49. Reservation states, abbreviations, and usage notes
                            Reservation     Abbreviation     Usage notes
                            state           in displays /
                                            output
|                           Waiting         W                Reservations are in the Waiting state:
|                                                            1. When LoadLeveler first creates a reservation.
|                                                            2. After one occurrence of a recurring reservation ends
|                                                               and before the next occurrence starts.

|                                                            While the reservation is in the Waiting state:
                                                             v Only administrators and reservation owners may
                                                               modify, cancel, and add users or groups to the
                                                               reservation.
                                                             v Administrators, reservation owners, and users or groups
                                                               that are allowed to use the reservation may query it, and
                                                               submit jobs to run during the reservation period.




    214   TWS LoadLeveler: Using and Administering
Table 49. Reservation states, abbreviations, and usage notes (continued)
Reservation     Abbreviation    Usage notes
state           in displays /
                output
Setup           S               LoadLeveler changes the state of a reservation from
                                Waiting to Setup just before the start time of the
                                reservation. The actual time at which LoadLeveler places
                                the reservation in Setup state depends on the value set for
                                the RESERVATION_SETUP_TIME keyword in the
                                configuration file.

                                While the reservation is in Setup state:
                                v Only administrators and reservation owners may
                                  modify, cancel, and add users or groups to the
                                  reservation.
                                v Administrators, reservation owners, and users or groups
                                  that are allowed to use the reservation may query it, and
                                  submit jobs to run during the reservation period.

                                During this setup period, LoadLeveler:
                                v Stops scheduling unbound job steps to reserved nodes.
                                v Preempts any jobs that are still running on the nodes
                                  that are reserved through this reservation. To preempt
                                  the running jobs, LoadLeveler uses the preemption
                                  method specified through the
                                  DEFAULT_PREEMPT_METHOD keyword in the
                                  configuration file.
                                  Note: The default value for
                                  DEFAULT_PREEMPT_METHOD is SU (suspend),
                                  which is not supported in all environments, and the
                                  default value for PREEMPTION_SUPPORT is NONE. If
                                  you want preemption to take place at the start of the
                                  reservation, make sure the cluster is configured for
                                  preemption (see “Steps for configuring a scheduler to
                                  preempt jobs” on page 130 for more information).
Active          A               At the reservation start time, LoadLeveler changes the
                                reservation state from Setup to Active. It also dispatches
                                only job steps that are bound to the reservation, until the
                                reservation completes or is canceled.

                                LoadLeveler does not dispatch bound job steps that:
                                v Require certain resources, such as floating consumable
                                  resources, that are not available during the reservation
                                  period.
                                v Have expected end times that exceed the end time of the
                                  reservation. By default, LoadLeveler allows such jobs to
                                  run, but their completion is subject to resource
                                  availability. (An administrator may configure
                                  LoadLeveler to prevent such jobs from running.)
                                These bound job steps remain idle unless the required
                                resources become available.

                                While the reservation is in Active state:
                                v Only administrators and reservation owners may
                                  modify, cancel, and add users or groups to the
                                  reservation.
                                v Administrators, reservation owners, and users or groups
                                  that are allowed to use the reservation may query it, and
                                  submit jobs to run during the reservation period.


                                                 Chapter 8. Building and submitting jobs   215
Table 49. Reservation states, abbreviations, and usage notes (continued)
                            Reservation     Abbreviation    Usage notes
                            state           in displays /
                                            output
                            Active_Shared AS                At the reservation start time, LoadLeveler changes the
                                                            reservation state from Setup to Active. It also dispatches
                                                            only job steps that are bound to the reservation, unless the
                                                            reservation was created with the SHARED mode. In this case,
                                                            if reserved resources are still available after LoadLeveler
                                                            dispatches any bound job steps that are eligible to run,
                                                            LoadLeveler changes the reservation state to
                                                            Active_Shared, and begins dispatching job steps that are
                                                            not bound to the reservation. Once the reservation state
                                                            changes to Active_Shared, it remains in that state until the
                                                            reservation completes or is canceled. During this time,
                                                            LoadLeveler dispatches both bound and unbound job
                                                            steps, pending resource availability; bound job steps are
                                                            considered before unbound job steps.

                                                            The conditions under which LoadLeveler will not dispatch
                                                            bound job steps are the same as those listed in the notes
                                                            for the Active state.

                                                            The actions that administrators, reservation owners, and
                                                            users may perform are the same as those listed in the
                                                            notes for the Active state.
                            Canceled        CA              When a reservation owner, administrator, or LoadLeveler
                                                            issues a request to cancel the reservation, LoadLeveler
                                                            changes the state of a reservation to Canceled and unbinds
                                                            any job steps bound to this reservation. When the
                                                            reservation is in this state, no one can modify or submit
                                                            jobs to this reservation.
                            Complete        C               When a reservation end time is reached, LoadLeveler
                                                            changes the state of a reservation to Complete. When the
                                                            reservation is in this state, no one can modify or submit
                                                            jobs to this reservation.



                 Creating new reservations
                            You must be an authorized user or member of an authorized group to successfully
                            create a reservation. LoadLeveler administrators define authorized users by adding
                            the max_reservations keyword to the user or group stanza in the administration
                            file.

                            The max_reservations keyword setting also defines how many reservations you are
                            allowed to own. Ask your administrator whether you are authorized to create
                            reservations.

                            To be authorized to create reservations, LoadLeveler administrators also must have
                            the max_reservations keyword set in their user or group stanza.

|                           To create a reservation, use the llmkres command. Specify the start time of the
|                           reservation using the -t command option and the duration of the reservation using
|                           the -d command option. If you are creating a recurring reservation, you must use
|                           the -t option to specify the schedule for that reservation.



    216   TWS LoadLeveler: Using and Administering
|   In addition to the start time and duration (or reservation schedule), you must also
|   use one of the following methods to specify how you want to select nodes for the
|   reservation.

|   Note: These methods are mutually exclusive.
    v The -n option on the llmkres command instructs LoadLeveler to reserve a
      number of nodes. LoadLeveler may select any unreserved node to satisfy a
      reservation. This command option is perhaps the easiest to use, because you
      need to know only how many nodes you want, not specific node characteristics.
      The minimum number of nodes a reservation must have is 1.
    v The -h option on the llmkres command instructs LoadLeveler to reserve specific
      nodes.
    v The -f option on the llmkres command instructs LoadLeveler to submit the
      specified job command file, and reserve appropriate nodes for the first job step
      in the job command file. Through this action, all job steps for the job are bound
      to the reservation. If the reservation request fails, LoadLeveler changes the state
      for all job steps for this job to NotQueued, and will not schedule any of those
      job steps to run.
    v The -j option on the llmkres command instructs LoadLeveler to reserve
      appropriate nodes for that job step. Through this action, the job step is bound to
      the reservation. If the reservation request fails, the job step remains in the same
      state as it was before.
    v The -c option on the llmkres command instructs LoadLeveler to reserve a
      number of Blue Gene compute nodes (C-nodes). The -j and -f option also reserve
      Blue Gene resources if the job type is bluegene.

    You also may define other reservation attributes, including:
    v Whether additional users or groups are allowed to use the reservation. Use the
      -U or -G command options, respectively.
    v Whether the reservation will be in one or both of these optional modes:
      – SHARED mode: When you use the -s command option, LoadLeveler allows
         reserved resources to be shared by job steps that are not associated with a
         reservation. This mode enables the efficient use of reserved resources; if the
         bound job steps do not use all of the reserved resources, LoadLeveler can
         schedule unbound job steps as well so the resources do not remain idle.
         Unless you specify this mode, however, only job steps bound to the
         reservation may use the reserved resources.
      – REMOVE_ON_IDLE mode: When you use the -i command option, LoadLeveler
         automatically cancels the reservation when all bound job steps that can run
         finish running. Using this mode is efficient because it prevents LoadLeveler
         from wasting reserved resources when no jobs are available to use them.
         Selecting this mode is especially useful for workloads that will run
         unattended.
|   v The default binding method to use when jobs are bound to the reservation. Use
|     the -m option to specify whether the soft or firm binding method should be
|     used when the binding method is not specified by the llbind command.
|     – Soft binding allows the bound job to use resources outside of the reservation.
|     – Firm binding restricts the job to the reserved resources.
|   v For a recurring reservation, when the reservation will expire. Use the -e option
|     to specify the expiration date of the recurring reservation.

    Additional rules apply to the use of these options; see “llmkres - Make a
    reservation” on page 459 for details.


                                                  Chapter 8. Building and submitting jobs   217
|                           Alternative: Use the ll_make_reservation and the ll_init_reservation_param
|                           subroutines in a program.

                            Tips:
|                           v If your user ID is not authorized to create any type of reservation but you are
                              member of a group with authority to create reservations, you must use the -g
                              option to specify the name of the authorized group on the llmkres command.
                            v Only reservations in waiting and in use are counted toward the limit of allowed
                              reservations set through the max_reservations keyword. LoadLeveler does not
|                             count reservations or recurring reservations that have already ended or are in
                              the process of being canceled.
|                           v For accounting purposes, although recurring reservations have multiple
|                             instances, a recurring reservation counts as one reservation no matter how many
|                             times it may recur during its reservation period.
|                           v Although you may create more than one reservation or recurring reservation for
                              a particular node or set of nodes, only one of those reservations may be active at
                              a time. If LoadLeveler determines that the reservation you are requesting will
                              overlap with another reservation, LoadLeveler fails the create request. No
                              reservation periods for the same set of machines can overlap.

                            If the create request is successful, LoadLeveler assigns and returns to the owner a
                            unique reservation identifier, in the form host.rid.r, where:
                            host     The name of the machine which assigned the reservation identifier.
                            rid      A number assigned to the reservation by LoadLeveler.
                            r        The letter r is used to distinguish a reservation identifier from a job step
                                     identifier.

                            The following are examples of reservation identifiers:
                            c94n16.80.r
                            c94n06.1.r

                            For details about the LoadLeveler interfaces for creating reservations, see:
                            v “llmkres - Make a reservation” on page 459.
                            v “ll_make_reservation subroutine” on page 653 and “ll_init_reservation_param
                              subroutine” on page 652.

                 Submitting jobs to run under a reservation
                            LoadLeveler administrators, reservation owners, and authorized users may submit
                            jobs to run under a reservation.

                            You may bind both batch and interactive POE job steps to a reservation, both
                            before a reservation starts or while it is active.

                            Before you begin:
                            v If you are a reservation owner and used the -f or -j options on the llmkres
                              command when you created the reservation, you do not have to perform the
                              steps listed in Table 50 on page 219. Those command options automatically bind
                              the job steps to the reservation. To find out whether a particular job step is
                              bound to a reservation, use the command llq -l and check the listing for a
                              reservation ID.
                            v To find out which reservation IDs you may use, check with your LoadLeveler
                              administrator, or enter the command llqres -l and check the names in the Users
                              or Groups fields (under the Modification time field) in the output listing. If your


    218   TWS LoadLeveler: Using and Administering
user name or a group name to which you belong appears in these output fields,
      you are authorized to use the reservation.
    v LoadLeveler cannot guarantee that certain resources will be available during a
      reservation period. If you submit job steps that require these resources,
      LoadLeveler will bind the job steps to the reservation, but will not dispatch
      them unless the resources become available during the reservation. These
      resources include:
      – Specific nodes that were not reserved under this reservation.
      – Floating consumable resources for a cluster.
      – Resources that are not released through preemption, such as virtual memory
         and adapters.
    v Whether bound job steps are successfully dispatched depends not only on
      resource availability, but also on administration file keywords that set maximum
      numbers, including:
      – max_jobs_scheduled
      – maxidle
      – maxjobs
      – maxqueued
      If LoadLeveler determines that scheduling a bound job will exceed one or more
      of these configured limits, your job will remain idle unless conditions permit
      scheduling at a later time during the reservation period.
    Table 50. Instructions for submitting a job to run under a reservation
    To bind this
    type of job:   Use these instructions:
    Already        Use the llbind command
    submitted
|   jobs           Alternative: Use the ll_bind_reservation subroutine in a program.

                   Result: LoadLeveler either sets the reservation ID for each job step that can
                   be bound to the reservation, or sends a failure notification for the bind
                   request.
    A new job      1. Specify the reservation ID through the LL_RES_ID environment variable
    that has not      or the ll_res_id command file keyword. The ll_res_id keyword takes
    been              precedence over the LL_RES_ID environment variable.
    submitted
                       Tip: You can use the ll_res_id keyword to modify the reservation to
                       submit to in a job command file filter.
                   2. Use the llsubmit command to submit the job.
                       Result: If the job can be bound to the requested reservation, LoadLeveler
                       sets the reservation ID for each job step that can be bound to the
                       reservation. Otherwise, if the job step cannot be bound to the reservation,
                       LoadLeveler changes the job state to NotQueued. To change the job step’s
                       state to Idle, issue the llbind -r command.


    Use the llqres command or llq command with the -l option to check the success or
    failure of the binding request for each job step.

|   Selecting firm or soft binding: There are two methods by which a job step can be
|   bound to a reservation: firm and soft. When a job step is firm bound to a
|   reservation, the job step can only use the reserved resources. A job step that is soft
|   bound to a reservation can be started before the reservation becomes active and
|   can use nodes that are not part of the reservation. Using soft binding is a way of
|   guaranteeing that resources will be available for the job step at a given time, but
|   allowing the job step to start earlier if there are available resources.

                                                        Chapter 8. Building and submitting jobs   219
Which method to use is specified by the -m option of the llbind command. If
                            neither is specified by llbind, the default method specified for the reservation is
                            used. Use llqres -l and review the Binding Method field to determine which
                            method is the default for a reservation.

|                           Binding a job step to a recurring reservation: When a job step is bound to a
|                           reservation, the job step can be considered for scheduling as soon as any
|                           occurrence of the reservation is active. If you do not want the job step to run right
|                           away, but instead you want it to run in a later occurrence of the reservation, you
|                           can specify which occurrence the job step will be bound to by adding the
|                           occurrence ID to the end of the reservation ID.

|                           The format of the reservation identifier is [host.]rid[.r[.oid]].

|                           where:
|                           v host is the name of the machine that assigned the reservation identifier.
|                           v rid is the number assigned to the reservation when it was created. An rid is
|                             required.
|                           v r indicates that this is a reservation ID (r is optional if oid is not specified).
|                           v oid is the occurrence ID of a recurring reservation (oid is optional).

|                           When oid is specified, the job step will not be considered for scheduling until that
|                           occurrence of the reservation becomes active. The step will remain in Idle state
|                           during all earlier occurrences.

|                           If a job step is bound to a recurring reservation, and the reservation occurrence’s
|                           end time is reached before the job step can be scheduled to run, the job step will
|                           be automatically bound to the next occurrence of the reservation by LoadLeveler.
|                           When the next occurrence becomes active, the job step will again be considered for
|                           scheduling.

|                           A job can be submitted with the recurring keyword set to yes in the job command
|                           file to specify that all steps of the job will be run in every occurrence of the
|                           reservation to which it is bound. When all steps of the job have completed, the
|                           entire job is requeued and all steps are bound to the next occurrence of the
|                           reservation.

                            For details about the LoadLeveler interfaces for submitting jobs under reservations,
                            see:
                            v “llbind - Bind job steps to a reservation” on page 415.
                            v “ll_bind subroutine” on page 645.
                            v “llsubmit - Submit a job” on page 531.

                 Removing bound jobs from the reservation
                            LoadLeveler administrators, reservation owners, and authorized users may use the
                            llbind command to unbind one or more existing jobs from a reservation.

|                           Alternative: Use the ll_bind_reservation subroutine in a program.

                            Result: LoadLeveler either unbinds the jobs from the reservation, or sends a failure
                            notification for the unbind request. Use the llqres or llq command to check the
                            success or failure of the remove request.




    220   TWS LoadLeveler: Using and Administering
For details about the LoadLeveler interfaces for removing bound jobs from the
          reservation, see:
          v “llbind - Bind job steps to a reservation” on page 415.
          v “ll_bind subroutine” on page 645.

    Querying existing reservations
|         Any LoadLeveler administrator or user can issue the llqres and llq commands to
|         query the status of an existing reservation or recurring reservation.

          Use these commands to request specific information about reservations:
          v Various options are available to filter reservations to be displayed.
          v To show details of specific reservations, use the llqres command with the -l
            option.
          v To show job steps that are bound to specific reservations, use the llq command
            with the -R option.

          For details about:
          v Reservation attributes and llqres command syntax, see “llqres - Query a
            reservation” on page 500.
          v llq command syntax, see “llq - Query job status” on page 479.

    Modifying existing reservations
          Only administrators and reservation owners can use the llchres command to
|         modify one or more attributes of a reservation or a recurring reservation.

          Certain attributes cannot be changed after a reservation has become active. Typical
          uses for the llchres command include the following:
          v Using the command llchres -U +newuser1 newuser2 to allow additional users to
            submit jobs to the reservation.
          v If a reservation was made through the command llmkres -h free but
            LoadLeveler cannot include a particular node because it is down, you can use
            the command llchres -h +node to add the node to the reserved node list when
            that node becomes available again.
          v If a reserved node is down after the reservation becomes active, a LoadLeveler
            administrator can use:
            – The command llchres -h -node to remove that node from the reservation.
            – The command llchres -h +1 to add another node to the reservation.
|         v Extending the expiration of a recurring reservation which may be about to
|           expire. You can use llchres -e to specify a new expiration date for the
|           reservation without having to create a new reservation.
|         v Making a temporary change to the next occurrence of a recurring reservation
|           without affecting any future occurrences of that reservation. For example, you
|           can use the -o option of the llchres command to temporarily add a user (-U) or
|           additional nodes (-n). Once that occurrence ends, the next occurrence will not
|           retain the change.

|         Alternative: Use the ll_change_reservation subroutine in a program.

          For details about the LoadLeveler interfaces for modifying reservations, see:
          v “llchres - Change attributes of a reservation” on page 424.
          v “ll_change_reservation subroutine” on page 648.



                                                        Chapter 8. Building and submitting jobs   221
Canceling existing reservations
|                           Administrators and reservation owners may use the llrmres command to cancel
|                           one or more reservations or to cancel some occurrences of a recurring reservation
|                           while leaving the remaining occurrences of that reservation unchanged in the
|                           system.

|                           The options available when canceling a reservation are:
|                           v Remove the entire reservation. All occurrences are removed and any bound job
|                             steps are automatically unbound from the reservation.
|                           v Remove a specific occurrence of the reservation. All other occurrences remain in
|                             the system and all bound job steps remain bound to the reservation.
|                           v Remove all occurrences during a specified interval. For example, a reservation
|                             may recur every day for one year, but during a one-week holiday period, the
|                             reservation is not needed. The reservation owner could cancel all of the
|                             occurrences during that one week period and all other occurrences would
|                             remain in the system and all bound job steps would remain bound to the
|                             reservation.
|                           If some occurrences are canceled and the result is that no occurrences remain, then
|                           the entire reservation is removed and all jobs are unbound from the reservation.

|                           Alternative: Use the ll_remove_reservation subroutine in a program.

                            Use the llqres command to check the success or failure of the remove request.

|                           Use the llqres -l command to see a list of canceled occurrence IDs or to note
|                           individual occurrence start times which have been omitted due to cancellation.

                            For details about the LoadLeveler interfaces for canceling reservations, see:
                            v “llrmres - Cancel a reservation” on page 508.
                            v “ll_remove_reservation subroutine” on page 658.

    Submitting jobs requesting scheduling affinity
                            You can request that a job use scheduling affinity by setting the RSET and
                            TASK_AFFINITY job command file keywords.

                            Specify RSET with a value of:
                            v RSET_MCM_AFFINITY to have LoadLeveler schedule the job to machines
                              where RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY.
                            v user_defined_rset to have LoadLeveler schedule the job to machines where
                              RSET_SUPPORT is enabled with a value of RSET_USER_DEFINED;
                              user_defined_rset is the name of a valid user-defined RSet.
                            Specifying the RSET job command file keyword defaults to requesting memory
                            affinity as a requirement and adapter affinity as a preference. Scheduling affinity
                            options can be customized by using the job command file keyword
                            MCM_AFFINITY_OPTIONS. For more information on these keywords, see “Job
                            command file keyword descriptions” on page 359.

                            Note: If a job specifies memory or adapter affinity scheduling as a requirement,
                                  LoadLeveler will only consider machines where RSET_SUPPORT is set to
                                  RSET_MCM_AFFINITY. If there are not enough machines satisfying the
                                  memory affinity requirements, the job will stay in the idle state.

    222   TWS LoadLeveler: Using and Administering
Specify TASK_AFFINITY with a value of:
|                 v CORE(n) to have LoadLeveler schedule the job to machines where
|                   RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. On SMT
|                   and ST nodes, LoadLeveler will assign n physical CPUs to each job task.
|                 v CPU(n) to have LoadLeveler schedule the job to machines where
|                   RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. On SMT
|                   nodes, LoadLeveler will assign n logical CPUs to each per job task. On ST
|                   nodes, LoadLeveler will assign n physical CPUs to each job task.

|                 Specify a requirement of SMT with a value of:
|                 v Enabled to have LoadLeveler schedule the job to machines where SMT is
|                   currently enabled.
|                   Example: #@ requirements = (SMT == "Enabled")
|                 v Disabled to have LoadLeveler schedule the job to machines where SMT is
|                   currently disabled or is not supported.
|                   Example: #@ requirements = (SMT == "Disabled")

                  OpenMP multithreaded jobs can be submitted requesting thread-level binding,
                  where each individual thread of an OpenMP application is bound to a separate
                  physical core processor or logical CPU. Use the parallel_threads job command file
                  keyword to request OpenMP thread-level binding, optionally, along with the
                  task_affinity job command file keyword.

                  The CPUs to individual OpenMP threads of the tasks are selected based on the
                  number of parallel threads (the parallel_threads job command file keyword) in
                  each task and set of CPUs or cores assigned (the task_affinity job command file
                  keyword) to the tasks. The CPUs are assigned to the threads only if at least one
                  CPU is available for each thread from the set of CPUs or cores assigned to the task.
                  If the number of CPUs in the set of CPUs or cores assigned to the tasks are not
                  sufficient to bind all of the threads, the job will not run.

                  This example binds 4 OpenMP parallel threads to 4 separate cores:
                  #@ task_affinity = Core(4)
                  #@ parallel_threads = 4

|                 Note: If you specify cpus_per_core along with your affinity request as:
|                       #@ task_affinity = core(n)
|                       #@ cpus_per_core = 1

|                       Then LoadLeveler allocates the requested number of CPUs to each task on
|                       SMT nodes only. The nodes running in ST mode are not assigned for the
|                       jobs requesting cpus_per_core.

    Submitting and monitoring jobs in a LoadLeveler multicluster
                  There are subtasks and associated instructions for submitting and monitoring jobs
                  in a LoadLeveler multicluster.

                  Table 51 on page 224 shows the subtasks and associated instructions for submitting
                  and monitoring jobs in a LoadLeveler multicluster:




                                                               Chapter 8. Building and submitting jobs   223
Table 51. Submitting and monitoring jobs in a LoadLeveler multicluster
                            Subtask                      Associated instructions (see . . . )
                            Prepare and submit a job     “Steps for submitting jobs in a LoadLeveler multicluster
                            in the LoadLeveler           environment”
                            multicluster
                            Display information about    v Use the llq -X cluster_name command to display information
                            a job in the LoadLeveler       about jobs on remote clusters.
                            multicluster environment
                                                         v Use llq -x -d to display the user’s job command file keyword
                                                           statements.
                                                         v    Use llq -X cluster_name -l to obtain multicluster specific
                                                             information.
                            Transfer an idle job from    Use the llmovejob command, which is described in “llmovejob
                            one cluster to another       - Move a single idle job from the local cluster to another
                            cluster                      cluster” on page 470.



                 Steps for submitting jobs in a LoadLeveler multicluster
                 environment
                            There are steps for submitting jobs in a LoadLeveler multicluster environment.

|                           In a multicluster environment, you can specify one of the following:
                            v That a job is to run on a particular cluster.
                            v That LoadLeveler is to decide which cluster is best from the list of clusters,
                              based on an administrator-defined metric. If any is specified, the job is
                              submitted to the best cluster, based on an administrator-defined metric.
|                           v That a job is a scale-across job which will run across multiple clusters

                            The following procedure explains how to prepare your job to be submitted in the
                            multicluster environment.

                            Before you begin: You need to know that:
                            v Only batch jobs are supported in the LoadLeveler multicluster environment.
                              LoadLeveler will fail any interactive jobs that you attempt to submit in a
                              multicluster environment.
                            v LoadLeveler assigns all steps of a multistep job to the same cluster.
                            v Job identifiers are assigned by the local cluster and are retained by the job
                              regardless of what cluster the job executes in.
                            v Remote jobs are subjected to the same configuration checks as locally submitted
                              jobs. Examples include account validation, class limits, include lists, and exclude
                              lists.

                            Perform the following steps to submit jobs to run in one cluster in a LoadLeveler
                            multicluster environment.
                            1. If files used by your job need to be copied between clusters, you must specify
                               the job files to be copied from the local to the remote cluster in the job
                               command file. Use the cluster_input_file and cluster_output_file keywords to
                               specify these files.
                               Rules:
                               v Any local file specified for copy must be accessible from the local gateway
                                  Schedd machines. Input files must be readable. Directories and permissions
                                  must be in place to write output files.



    224   TWS LoadLeveler: Using and Administering
v Any remote file specified for copy must be accessible from the remote
          gateway Schedd machines. Directories and permissions must be in place to
          write input files. Output files must be readable when the job terminates.
       v To copy more than one file, these keywords can be specified multiple times.
       Tip: Each instance of these keywords allows you to specify a single local file
       and a single remote file. If your job requires copying multiple files (for
       example, all files in a directory), you may want to use a procedure to
       consolidate the multiple files into a single file rather than specify multiple
       cluster_file statements in the job command file. The following is an example of
       how you could consolidate input files:
       a. Use the tar command to produce a single tar file from multiple files.
       b. On the cluster_input_file keyword, specify the file that resulted from the
           tar command processing.
       c. Modify your job command file such that it uses the tar command to restore
           the multiple files from the tar file prior to invoking your application.
    2. In the job command file, specify the clusters to which LoadLeveler may submit
       the job. The cluster_list keyword is a blank-delimited list of cluster names or
       the reserved word any where:
       v A single cluster name indicates that the job is to be submitted to that cluster.
       v A list of multiple cluster names indicates that the job is to be submitted to
          one of the clusters as determined by the installation exit
          CLUSTER_METRIC.
       v The reserved word any indicates that the job is to be submitted to any
          cluster defined by the installation exit CLUSTER_METRIC.
       Alternative: You can specify the clusters to which LoadLeveler can submit your
       job on the llsubmit command using the -X option.
|   3. Use the llsubmit command to submit the job.
|      Tip: You may use the -X option on the llsubmit command to specify:
|      -X {cluster_list | any}
|               Is a blank-delimited list of cluster names or the reserved word any
|               where:
|               v A single cluster name indicates that the job is to be submitted to that
|                  cluster.
|               v A list of multiple cluster names indicates that the job is to be
|                  submitted to one of the clusters as determined by the installation exit
|                  CLUSTER_METRIC.
|               v The reserved word any indicates that the job is to be submitted to
|                  any cluster defined by the installation exit CLUSTER_METRIC.

|      Note: If a remote job is submitted with a list of clusters or the reserved word
|            any and the installation exit CLUSTER_METRIC is not specified, the
|            remote job is not submitted.

    Perform the following steps to submit scale-across jobs to run across multiple
    clusters in a multicluster environment:
    1. In the job command file, specify the cluster_option keyword as scale_across.
       Alternative: You can submit a scale-across job using the -S option of the
       llsubmit command.
    2. You can limit which clusters can be used to run the job by using the
       cluster_list keyword to specify the limited set of cluster. For a scale-across job,
       if the cluster_list keyword is not specified or the reserved word any is
       specified in the cluster_list, all clusters may be used to run the job.
       Alternative: You can limit which clusters can be used to run the scale-across job
       using the -X option of the llsubmit command.

                                                  Chapter 8. Building and submitting jobs   225
3. Use the llsubmit command to submit the job from any cluster in the
                               scale-across multicluster environment.

                            The llsubmit command displays the assigned local outbound Schedd, the assigned
                            remote inbound Schedd, the scheduling cluster and the job identifier when the
                            remote job has been successfully submitted. Use the -q flag to stop these additional
                            messages from being displayed.

                            When you are done, you can use commands to display information about the
                            submitted job; for example:
                            v Use llq -l -X cluster_name -j job_id where cluster_name and job_id were displayed
                              by the llsubmit command to display information about the remote job.
                            v Use llq -l -X cluster_list to display the long listing about jobs, including
                              scheduling cluster, submitting cluster, user-requested cluster, cluster input and
                              output files.
                            v Use llq -X all to display information about all jobs in all configured clusters.
|                           v Use llq twice to display the job status for a scale-across job on all clusters where
|                             the job has been distributed. In the first command, specify the -l option to
|                             display the set of clusters where the job has been distributed (the value from the
|                             Cluster List output line). The second time you run the command, specify the -X
|                             option with the list of clusters reported from the first command. The result from
|                             that command shows the job status on the other clusters.

    Submitting and monitoring Blue Gene jobs
                            The following procedure explains how to prepare your job to be submitted to the
                            Blue Gene system.

                            The submission of Blue Gene jobs is similar to the submission of other job types.

                            Before you begin: You need to know that checkpointing Blue Gene jobs is not
                            currently supported.

                            Tip: Use the llstatus command to check if Blue Gene support is enabled and
                            whether Blue Gene is currently present. The llstatus command will display:
                            The BACKFILL scheduler with Blue Gene support is in use

                            Blue Gene is present

                            when Blue Gene is support is enabled and Blue Gene is currently present

                            Perform the following steps to submit Blue Gene jobs:
                            1. In the job command file, set the job type to Blue Gene by specifying:
                                #@job_type = bluegene
                            2. Specify the size or shape of the Blue Gene job or the Blue Gene partition in
                               which the job will run.
                               v The size of the Blue Gene job can be specified by using the job command file
                                 keyword bg_size to specify the size of the job. For more information, see the
                                 detailed description of the bg_size keyword.
                               v The shape of the Blue Gene job can be specified by using the job command
                                 file keyword bg_shape to specify the shape of the job. If you require the
                                 specific shape you specified, you may wish to specify the bg_rotate keyword
                                 to false. For more information on these keywords, see the detailed
                                 descriptions of the bg_shape keyword and bg_rotate keyword.

    226   TWS LoadLeveler: Using and Administering
v The partition in which the Blue Gene job is run can be specified using the
          bg_partition job command file keyword. For more information, see the
          detailed description of the bg_partition keyword.
|      v The size of a Blue Gene job refers to the number of Blue Gene compute
|         nodes instead of the number of tasks running on Startd machines. The
|         following keywords cannot be used to control the size of a Blue Gene job:
|         – node
|         – tasks_per_node
|         – total_tasks
    3. Specify any other job command file keywords you require, including the
       bg_connection and bg_requirements Blue Gene job command file keywords.
       See “Job command file keyword descriptions” on page 359 for more
       information on job command file keywords.
    4. Upon completing your job command file, submit the job using the llsubmit
       command.

    If you experience a problem submitting a Blue Gene job, see “Troubleshooting in a
    Blue Gene environment” on page 717 for common questions and answers
    pertaining to operations within a Blue Gene environment.

    When you are done, you can use the llq -b command to display information about
    Blue Gene jobs in short form. For more information see “llq - Query job status” on
    page 479.

    Example:

    The following is a sample job command file for a Blue Gene job:
    # @ job_name            = bgsample
    # @ job_type            = bluegene
    # @ comment             = "BGL Job by Size"
    # @ error               = $(job_name).err
    # @ output              = $(job_name).out
    # @ environment         = COPY_ALL;
    # @ wall_clock_limit    = 200:00,200:00
    # @ notification        = always
    # @ notify_user         = sam
    # @ bg_size             = 1024
    # @ bg_connection       = torus
    # @ class               = 2bp
    # @ queue
    /usr/bin/mpirun -exe /bgscratch/sam/com -verbose 2 -args "-o 100     -b 64 -r"




                                                 Chapter 8. Building and submitting jobs   227
228   TWS LoadLeveler: Using and Administering
Chapter 9. Managing submitted jobs
               This is a list of the tasks and sources of additional information for managing
               LoadLeveler jobs.

               Table 52 lists the tasks and sources of additional information for managing
               LoadLeveler jobs.
               Table 52. Roadmap of user tasks for managing submitted jobs
               To learn about:                Read the following:
               Displaying information about   v “Querying the status of a job”
               a submitted job or its
                                              v “Working with machines” on page 230
               environment
                                              v “Displaying currently available resources” on page 230
                                              v “llclass - Query class information” on page 433
                                              v “llq - Query job status” on page 479
                                              v “llstatus - Query machine status” on page 512
                                              v “llsummary - Return job resource information for
                                                accounting” on page 535
               Changing the priority of a     v “Setting and changing the priority of a job” on page 230
               submitted job
                                              v “llmodify - Change attributes of a submitted job step” on
                                                page 464
               Changing the state of a        v “Placing and releasing a hold on a job” on page 232
               submitted job
                                              v “Canceling a job” on page 232
                                              v “llhold - Hold or release a submitted job” on page 454
                                              v “llcancel - Cancel a submitted job” on page 421
               Checkpointing a submitted      v “Checkpointing a job” on page 232
               job
                                              v “llckpt - Checkpoint a running job step” on page 430



Querying the status of a job
               Once you submit a job, you can query the status of the job to determine, for
               example, if it is still in the queue or if it is running.

               You also receive other job status related information such as the job ID and the
               submitting user ID. You can query the status of a LoadLeveler job either by using
               the GUI or the llq command. For an example of querying the status of a job, see
               Chapter 10, “Example: Using commands to build, submit, and manage jobs,” on
               page 235.

               Querying the status of a job using a submit-only machine: In addition to
               allowing you to submit and cancel jobs, a submit-only machine allows you to
               query the status of jobs. You can query a job using either the submit-only version
               of the GUI or by using the llq command. For information on llq, see “llq - Query
               job status” on page 479.




                                                                                                         229
Working with machines
                        There are types tasks related to machines.

                        You can perform the following types of tasks related to machines:
                        v Display machine status
                          When you submit a job to a machine, the status of the machine automatically
                          appears in the Machines window on the GUI. This window displays machine
                          related information such as the names of the machines running jobs, as well as
                          the machine’s architecture and operating system. For detailed information on
                          one or more machines in the cluster, you can use the Details option on the
                          Actions pull-down menu. This will provide you with a detailed report that
                          includes information such as the machine’s state and amount of installed
                          memory.
                           For an example of displaying machine status, see Chapter 10, “Example: Using
                           commands to build, submit, and manage jobs,” on page 235.
                        v Display central manager
                          The LoadLeveler administrator designates one of the machines in the
                          LoadLeveler cluster as the central manager. When jobs are submitted to any
                          machine, the central manager is notified and decides where to schedule the jobs.
                          In addition, it keeps track of the status of machines in the cluster and jobs in the
                          system by communicating with each machine. LoadLeveler uses this information
                          to make the scheduling decisions and to respond to queries.
                          Usually, the system administrator is more concerned about the location of the
                          central manager than the typical end user but you may also want to determine
                          its location. One reason why you might want to locate the central manager is if
                          you want to browse some configuration files that are stored on the same
                          machine as the central manager.
                        v Display public scheduling machines
                          Public scheduling machines are machines that participate in the scheduling of
                          LoadLeveler jobs on behalf of users at submit-only machines and users at other
                          workstations that are not running the Schedd daemon. You can find out the
                          names of all these machines in the cluster.
                          Submit-only machines allow machines that are not part of the LoadLeveler
                          cluster to submit jobs to the cluster for processing.

Displaying currently available resources
                        The LoadLeveler user can get information about currently available resources by
                        using the llstatus command with either the -F, or -R options.

                        The -F option displays a list of all of the floating resources associated with the
                        LoadLeveler cluster. The -R option lists all of the consumable resources associated
                        with all of the machines in the LoadLeveler cluster. The user can specify a hostlist
                        with the llstatus command to display only the consumable resources associated
                        with specific hosts.

Setting and changing the priority of a job
                        LoadLeveler uses the priority of a job to determine its position among a list of all
                        jobs waiting to be dispatched.




230   TWS LoadLeveler: Using and Administering
LoadLeveler schedules jobs based on the adjusted system priority, which takes in
      account both system priority and user priority:
      User priority
             Every job has a user priority associated with it. A job with a higher priority
             runs before a job with a lower priority (when both jobs are owned by the
             same user). You can set this priority through the user_priority keyword in
             the job command file, and modify it through the llprio command. See
             “llprio - Change the user priority of submitted job steps” on page 477 for
             more information.
      System priority
             Every job has a system priority associated with it. Administrators can set
             this priority in the configuration file using the SYSPRIO keyword
             expression. The SYSPRIO expression can contain class, group, and user
             priorities, as shown in the following example:
              SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate)


              The SYSPRIO expression is evaluated by LoadLeveler to determine the
              overall system priority of a job. To determine which jobs to run first,
              LoadLeveler does the following:
              1. Assigns a system priority value when the negotiator adds the new job
                 to the queue of jobs eligible for dispatch.
              2. Orders jobs first by system priority.
              3. Assigns jobs belonging to the same user and the same class an adjusted
                 system priority, which takes all the system priorities and orders them
                 by user priority. Jobs with a higher adjusted system priority are
                 scheduled ahead of jobs with a lower adjusted system priority.
              Only administrators may modify the system priority through the llmodify
              command with the -s option. See “llmodify - Change attributes of a
              submitted job step” on page 464 for more information.

Example: How does a job’s priority affect dispatching order?
      To understand how a job’s priority affects dispatching order, consider the sample
      jobs which lists the priorities assigned to jobs submitted by two users, Rich and
      Joe.

      To understand how a job’s priority affects dispatching order, consider the sample
      jobs in Table 53, which lists the priorities assigned to jobs submitted by two users,
      Rich and Joe.

      Two of the jobs belong to Joe, and three belong to Rich. User Joe has two jobs (Joe1
      and Joe2) in Class A with SYSPRIOs of 9 and 8 respectively. Since Joe2 has the
      higher user priority (20), and because both of Joe’s jobs are in the same class, Joe2’s
      priority is swapped with that of Joe1 when the adjusted system priority is
      calculated. This results in Joe2 getting an adjusted system priority of 9, and Joe1
      getting an adjusted system priority of 8. Similarly, the Class A jobs belonging to
      Rich (Rich1 and Rich3) also have their priorities swapped. The priority of the job
      Rich2 does not change, since this job is in a different class (Class B).
      Table 53. How LoadLeveler handles job priorities
                                             System Priority                            Adjusted
            Job           User Priority        (SYSPRIO)              Class          System Priority
           Rich1                50                  10                  A                    6



                                                             Chapter 9. Managing submitted jobs    231
Table 53. How LoadLeveler handles job priorities (continued)
                                                              System Priority                     Adjusted
                               Job           User Priority      (SYSPRIO)              Class   System Priority
                               Joe1               10                 9                  A            8
                               Joe2               20                 8                  A            9
                              Rich2              100                 7                  B            7
                              Rich3               90                 6                  A            10



Placing and releasing a hold on a job
                        You may place a hold on a job and thereby cause the job to remain in the queue
                        until you release it.

                        There are two types of holds: a user hold and a system hold. Both you and your
                        LoadLeveler administrator can place and release a user hold on a job. Only a
                        LoadLeveler administrator, however, can place and release a system hold on a job.

                        You can place a hold on a job or release the hold either by using the GUI or the
                        llhold command. For examples of holding and releasing jobs, see Chapter 10,
                        “Example: Using commands to build, submit, and manage jobs,” on page 235.

                        As a user or an administrator, you can also use the startdate keyword to place a
                        hold on a job. This keyword allows you to specify when you want to run a job.

Canceling a job
                        You can cancel one of your jobs that is either running or waiting to run by using
                        either the GUI or the llcancel command. You can use llcancel to cancel
                        LoadLeveler jobs, including jobs from a submit-only machine.

                        For more information about the llcancel command, see “llcancel - Cancel a
                        submitted job” on page 421.

Checkpointing a job
                        Checkpointing is a method of periodically saving the state of a job so that, if for
                        some reason, the job does not complete, it can be restarted from the saved state.
                        Checkpoints can be taken either under the control of the user application or
                        external to the application.

                        On AIX only, the LoadLeveler API ll_init_ckpt is used to initiate a serial
                        checkpoint from the user application. For initiating checkpoints from within a
                        parallel application, the API mpc_init_ckpt should be used. These APIs allow the
                        writer of the application to determine at what points in the application it would be
                        appropriate save the state of the job. To enable parallel applications to initiate
                        checkpointing, you must use the APIs provided with the Parallel Environment (PE)
                        program. For information on parallel checkpointing, see IBM Parallel Environment
                        for AIX and Linux: Operation and Use, Volume 1.

                        It is also possible to checkpoint a program running under LoadLeveler outside the
                        control of the application. There are several ways to do this:
                        v Use the llckpt command to initiate checkpoint for a specific job step. See “llckpt
                           - Checkpoint a running job step” on page 430 for more information.

232   TWS LoadLeveler: Using and Administering
v Checkpoint from a program which invokes the ll_ckpt API to initiate checkpoint
  of a specific job step. See “ll_ckpt subroutine” on page 550 for more information.
v Have LoadLeveler automatically checkpoint all running jobs that have been
  enabled for checkpoint.To enable this automatic checkpoint, specify checkpoint
  = interval in the job command file.
v As the result of an llctl flush command.

Note: For interactive parallel jobs, the environment variable CHECKPOINT must
      be set to yes in the environment prior to starting the parallel application or
      the job will not be enabled for checkpoint. For more information see, IBM
      Parallel Environment for AIX and Linux: MPI Programming Guide.




                                                 Chapter 9. Managing submitted jobs   233
234   TWS LoadLeveler: Using and Administering
Chapter 10. Example: Using commands to build, submit, and
manage jobs
            The following procedure presents a series of simple tasks that a user might
            perform using commands.

            For additional information about individual commands noted in the procedure, see
            Chapter 16, “Commands,” on page 411.
            1. Build your job command file by using a text editor to create a script file. Into
               the file enter the name of the executable, other keywords designating such
               things as output locations for messages, and the necessary LoadLeveler
               statements, as shown in Figure 35:

            #   This job command file is called longjob.cmd. The
            #   executable is called longjob, the input file is longjob.in,
            #   the output file is longjob.out, and the error file is
            #   longjob.err.
            #
            #   @   executable   =   longjob
            #   @   input        =   longjob.in
            #   @   output       =   longjob.out
            #   @   error        =   longjob.err

            # @ queue

            Figure 35. Building a job command file

            2. You can optionally edit the job command file you created in step 1.
            3. To submit the job command file that you created in step 1, use the llsubmit
               command:
                    llsubmit longjob.cmd
                    LoadLeveler responds by issuing a message similar to:
                    submit: The job "wizard.22" has been submitted.

               Where wizard is the name of the machine to which the job was submitted and
               22 is the job identifier (ID). You may want to record the identifier for future use
               (although you can obtain this information later if necessary).
            4. To display the status of the job you just submitted, use the llq command. This
               command returns information about all jobs in the LoadLeveler queue:
                    llq wizard.22
               Where wizard is the machine name to which you submitted the job, and 22 is
               the job ID. You can also query this job using the command llq wizard.22.0,
               where 0 is the step ID.
            5. To change the priority of a job, use the llprio command. To increase the priority
               of the job you submitted by a value of 10, enter:
                    llprio +10 wizard.22.0
                    You can change the user priority of a job that is in the queue or one that is
                    running. This only affects jobs belonging to the same user and the same class. If
                    you change the priority of a job in the queue, the job’s priority increases or
                    decreases in relation to your other jobs in the queue. If you change the priority
                    of a job that is running, it does not affect the job while it is running. It only



                                                                                                 235
affects the job if the job re-enters the queue to be dispatched again. For more
                           information, see “Setting and changing the priority of a job” on page 230.
                        6. To place a temporary hold on a job in a queue, use the llhold command. This
                           command only takes effect if jobs are in the Idle or NotQueued state. To place a
                           hold on wizard.22.0, enter:
                            llhold wizard.22.0
                        7. To release the hold you placed in step 6, use the llhold command:
                            llhold -r wizard.22.0
                        8. To display the status of the machine to which you submitted a job, use the
                           llstatus command:
                            llstatus -l wizard
                        9. To cancel wizard.22.0, use the llcancel command:
                            llcancel wizard.22.0




236   TWS LoadLeveler: Using and Administering
Chapter 11. Using LoadLeveler’s GUI to build, submit, and
    manage jobs
|                   Note: This is the last release that will provide the Motif-based graphical user
|                   interface xloadl. The function available in xloadl has been frozen since TWS
|                   LoadLeveler 3.3.2.

                    You do not have to perform the tasks in the order listed. You may perform certain
                    tasks before others without any difficulty; however, some tasks must be performed
                    prior to others for succeeding tasks to work. For example, you cannot submit a job
                    if you do not have a job command file that you built using either the GUI or an
                    editor.

                    The tasks included in this topic are listed in Table 54.
                    Table 54. User tasks available through the GUI
                    Subtask                    Associated information (see...)
                    Building and submitting    v “Building jobs”
                    jobs
                                               v “Editing the job command file” on page 249
                                               v “Submitting a job command file” on page 250
                    Obtaining job status       v “Displaying and refreshing job status” on page 251
                                               v “Specifying which jobs appear in the Jobs window” on page
                                                 258
                                               v “Sorting the Jobs window” on page 252
                    Managing a submitted job v “Changing the priority of your jobs” on page 253
                                               v “Placing a job on hold” on page 253
                                               v “Releasing the hold on a job” on page 253
                                               v “Canceling a job” on page 254
                    Working with machines      v “Displaying and refreshing machine status” on page 255
                                               v “Specifying which machines appear in Machines window” on
                                                 page 259
                                               v “Sorting the Machines window” on page 257
                                               v “Finding the location of the central manager” on page 257
                                               v “Finding the location of the public scheduling machines” on
                                                 page 258
                    Saving LoadLeveler         “Saving LoadLeveler messages in a file” on page 259
                    messages in a file



    Building jobs
                    Use these instructions when building jobs.

                    From the Jobs window:
                    SELECT
                          File → Build a Job
                              The dialog box shown in Figure 36 on page 238 appears:


                                                                                                             237
Figure 36. LoadLeveler build a job window

                                 Complete those fields for which you want to override what is currently
                                 specified in your skel.cmd defaults file. Sample skel.cmd and
                                 mcluster_skel.cmd files are found in the samples subdirectory of the

238   TWS LoadLeveler: Using and Administering
release directory. You can update this file to define defaults for your site,
             and then update the *skelfile resource in Xloadl to point to your new
             skel.cmd file. If you want a personal defaults file, copy skel.cmd to one of
             your directories, edit the file, and update the *skelfile resource in
             .Xdefaults. Table 55 shows the fields displayed in the Build a Job window:
Table 55. GUI fields and input
Field                  Input
Executable             Name of the program to run. It must be an executable file.

                       Optional. If omitted, the command file is executed as if it were a shell
                       script.
Arguments              Parameters to pass to the program.

                       Required only if the executable requires them.
Stdin                  Filename to use as standard input (stdin) by the program.

                       Optional. The default is /dev/null.
Stdout                 Filename to use as standard output (stdout) by the program.

                       Optional. The default is /dev/null.
Stderr                 Filename to use as standard error (stderr) by the program.

                       Optional. The default is /dev/null.
Cluster Input File A comma delimited local and remote path name pair, representing the
                   local file to copy to the remote location. If you have more than one pair
                   to enter, the More button will display a Cluster Input Files input
                   window.

                       Optional. The default is no files are copied.
Cluster Output         A comma delimited local and remote path name pair, representing the
File                   local file destination to copy to the remote file into. If you have more
                       than one pair to enter, the More button will display a Cluster Output
                       Files input window.

                       Optional. The default is no files are copied.
Initialdir             Initial directory. LoadLeveler changes to this directory before running
                       the job.

                       Optional. The default is your current working directory.
Notify User            User id of person to notify regarding status of submitted job.

                       Optional. The default is your userid.
StartDate              Month, day, and year in the format mm/dd/yyyy. The job will not start
                       before this date.

                       Optional. The default is to run the job as soon as possible.
StartTime              Hour, minute, second in the format hh:mm:ss. The job will not start
                       before this time.

                       Optional. The default is to run the job as soon as possible.

                       If you specify StartTime but not StartDate, the default StartDate is the
                       current day. If you specify StartDate but not StartTime, the default
                       StartTime is 00:00:00. This means that the job will start as soon as
                       possible on the specified date.



                         Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs    239
Table 55. GUI fields and input (continued)
                        Field               Input
                        Priority            Number between 0 and 100, inclusive.

                                            Optional. The default is 50.

                                            This is the user priority. For more information on this priority, refer to
                                            “Setting and changing the priority of a job” on page 230.
                        Image size          Number in kilobytes that reflects the maximum size you expect your
                                            program to grow to as it runs.

                                            Optional.
                        Class               Class name. The job will only run on machines that support the
                                            specified class name. Your system administrator defines the class names.

                                            Optional:
                                            v Press the Choices button to get a list of available classes.
                                            v Press the Details button under the class list to obtain long listing
                                              information about classes.
                        Hold                Hold status of the submitted job. Permitted values are:
                                            user    User hold
                                            system System hold (only valid for LoadLeveler administrators)
                                            usersys User and system hold (only valid for LoadLeveler
                                                    administrators)

                                            Note: The default is a no-hold state.
                        Account Number      Number associated with the job. For use with the llacctmrg and
                                            llsummary commands for acquiring job accounting data.

                                            Optional. Required only if the ACCT keyword is set to A_VALIDATE in
                                            the configuration file.
                        Environment         Your initial environment variables when your job starts. Separate
                                            environment specifications with semicolons.

                                            Optional.
                        Copy                All or Master, to indicate whether the environment variables specified in
                        Environment         the keyword Environment are copied to all nodes or just to the master
                                            node of a parallel job.

                                            Optional.
                        Shell               The name of the shell to use for the job.

                                            Optional. If not specified, the shell used in the owner’s password file
                                            entry is used. If none is specified, /bin/sh is used.
                        Group               The LoadLeveler group name to which the job belongs.

                                            Optional.
                        Step Name           The name of this job step.

                                            Optional.




240   TWS LoadLeveler: Using and Administering
Table 55. GUI fields and input (continued)
Field              Input
Node Usage         How the node is used. Permitted values are:
                   shared
                        The node can be shared with other tasks of other job steps. This is
                        the default.
                   not shared
                        The node cannot be shared.
                   slice not shared
                        Has the same meaning as not shared. It is provided for
                        compatibility.
Dependency         A Boolean expression defining the relationship between the job steps.

                   Optional.
Large Page         Whether or not the job step requires Large Page memory.
                   yes
                       Use Large Page memory if available, otherwise use regular memory.
                   mandatory
                       Use of Large Page memory is mandatory.
                   no Do not use Large Page memory.
Bulk Transfer      Indicates to the communication subsystem whether it should use the
                   bulk transfer mechanism to communicate between tasks.
                   yes
                       Use bulk transfer.
                   no Do not use bulk transfer.

                   Optional.
Rset               What type of RSet support is requested. Permitted values are:
                   rset_mcm_affinity
                        Requests scheduling affinity.
                        Use the MCM options button to specify task allocation method,
                        memory affinity preference or requirement, and adapter affinity
                        preference or requirement.
                   rset_name
                        Requests a user defined RSet and nodes with rset_support set to
                        rset_user_defined.

                   Optional.
Comments           Comments associated with the job. These comments help to distinguish
                   one job from another job.

                   Optional.
SMT                Indicates whether a job requires dynamic simultaneous multithreading
                   (SMT) function.
                   yes
                        The job requires SMT function.
                   no The job does not require SMT function.
                   as_is
                        The SMT state will not be changed.
Note: The fields that appear in this table are what you see when viewing the Build a Job
window. The text in these fields does not necessarily correspond with the keywords listed in
“Job command file keyword descriptions” on page 359.


        See “Job command file keyword descriptions” on page 359 for information
        on the defaults associated with these keywords.


                     Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   241
SELECT
                              A Job Type if you want to change the job type.
                                 Your choices are:
                                 Serial Specifies a     serial job. This is the default.
                                 Parallel
                                          Specifies a   parallel job.
                                 Blue Gene
                                          Specifies a   bluegene job.
                                 MPICH
                                          Specifies a   MPICH job.
                                 Note that the job type you select affects the choices that are active on the
                                 Build A Job window.
                        SELECT
                              a Notification option.
                                 Your choices are:
                                 Always
                                        Notify you when the job starts, completes, and if it incurs errors.
                                 Complete
                                        Notify you when the job completes. This is the default option as
                                        initially defined in the skel.cmd file.
                                 Error Notify you if the job cannot run because of an error.
                                 Never Do not notify you.
                                 Start Notify you when the job starts.
                        SELECT
                              a Restart option.
                                 Your choices are:
                                 No       This job is not restartable. This is the default.
                                 Yes      Restart the job.
                        SELECT
                              To restart the job on the same nodes from which it was vacated.
                                 Your choices are:
                                 No       Restart the job on any available nodes.
                                 Yes      Restart the job on the same nodes it ran on previously. This option
                                          is valid after a job has been vacated.

                                 Note that there is no default for the selection.
                        SELECT
                              a Checkpoint option.
                                 Your choices are:
                                 No      Do not checkpoint the job. This is the default.
                                 Yes     Yes, checkpoint the job at intervals you determine. See the
                                         checkpoint keyword for more information.
                                 Interval
                                         Yes, checkpoint the job at intervals determined by LoadLeveler. See
                                         the checkpoint keyword for more information.
                        SELECT
                              To start from a checkpoint file
                                 Your choices are:

242   TWS LoadLeveler: Using and Administering
No      Do not start the job from a checkpoint file (start job from
                   beginning).
           Yes     Yes, restart the job from an existing checkpoint file when you
                   submit the job. The file name must be specified by the job
                   command file. The directory name may be specified by the job
                   command file, configuration file, or default location.
SELECT
      Coschedule if you want steps within a job to be scheduled and dispatched
      at the same time.
           Your choices are:
           No     Disables coscheduling for your job step.
           Yes    Allows coscheduling to occur for your job step.

                   Note:
                           1. This keyword is not inherited by other job steps.
                           2. The default is No.
                           3. The coscheduling function is only available with the
                              BACKFILL scheduler.
SELECT
      Nodes (available when the job type is parallel)
            The Nodes dialog box appears.
           Complete the necessary fields to specify node information for a parallel job
           (see Table 56). Depending upon which model you choose, different fields
           will be available; any unavailable fields will be desensitized. LoadLeveler
           will assign defaults for any fields that you leave blank. For more
           information, see the appropriate job command file keyword (listed in
           parentheses) in “Job command file keyword descriptions” on page 359.
Table 56. Nodes dialog box
Field               Available in:       Input
Min # of Nodes      Tasks Per Node      Minimum number of nodes required for running the
                    Model and Tasks     parallel job (node keyword).
                    with Uniform
                    Blocking Model      Optional. The default is one.
Max # of Nodes      Tasks Per Node      Maximum number of nodes required for running the
                    Model               parallel job (node keyword).

                                        Optional. The default is the minimum number of
                                        nodes.
Tasks per Node      Tasks Per Node      The number of tasks of the parallel job you want to
                    Model               run per node (tasks_per_node keyword).

                                        Optional.
Total Tasks         Tasks with          The total number of tasks of the parallel job you
                    Uniform Blocking    want to run on all available nodes (total_tasks
                    Model, and          keyword).
                    Custom Blocking
                    Model               Optional for Uniform, required for Custom Blocking.
                                        The default is one.
Blocking            Custom Blocking     The number of tasks assigned (as a block) to each
                    Model               consecutive node until all of a job’s tasks have been
                                        assigned (blocking keyword)




                      Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   243
Table 56. Nodes dialog box (continued)
                        Field               Available in:       Input
                        Task Geometry       Custom              The task ids of each task that you want to run on
                                            Geometry Model      each node. You can use the ″Set Geometry″ button for
                                                                step-by-step directions (task_geometry keyword).


                        SELECT
                              Close to return to the Build a Job dialog box.
                        SELECT
                              Network (available when the job type is parallel)
                                   The Network dialog box appears.
                                 The Network dialog box consists of two parts: The top half of the panel is
                                 for MPI, and the bottom half is for LAPI. Click on the check box to the left
                                 of MPI or LAPI to activate the part of the panel for which you want to
                                 specify network information. If you want to use MPI with LAPI, click on
                                 both:
                                 v The MPI check box.
                                 v The check box for Share windows between MPI and LAPI.
                                 Complete those fields for which you want to specify network information
                                 (see Table 57). For more information, see the network keyword description
                                 in “Job command file keyword descriptions” on page 359.
                        Table 57. Network dialog box fields
                        Field                    Input
                        MPI (MPI/LAPI)           Select:
                                                 v Only the MPI check box to use the Message Passing Interface
                                                   (MPI) protocol only.
                                                 v Both the MPI check box and the Share windows between MPI
                                                   and LAPI check box to use both MPI and the Low-level
                                                   Application Programming Interface (LAPI) protocols. This
                                                   selection corresponds to setting the network keyword in the job
                                                   command file to MPI_LAPI.

                                                 Optional.
                        LAPI                     Select the LAPI check box to use Low-level Application
                                                 Programming Interface (LAPI) protocol only.

                                                 Optional.
                        Adapter/Network          Select an adapter name or a network type from the list.

                                                 Required for each protocol you select.
                        Adapter Usage            Specifies that the adapter is either shared or not shared.

                                                 Optional. The default is shared.
                        Communication Mode Specifies the communication subsystem mode used by the
                                           communication protocol that you specify and can be either IP
                                           (Internet Protocol) or US (User Space).

                                                 Optional. The default is IP.
                        Communication Level      Implies the amount of memory to be allocated to each window for
                                                 User Space mode. Allocation can be Low, Average, or High. It is
                                                 ignored by Switch_Network_Interface_For_HPS adapters.



244   TWS LoadLeveler: Using and Administering
Table 57. Network dialog box fields (continued)
Field                    Input
Instances                Specifies the number of windows or IP addresses the
                         communication subsystem should allocate to this protocol.

                         Optional. The default is 1 unless sn_all is specified for network and
                         then the default is max.
rCxt Blocks              The number of user rCxt blocks requested for each window used by
                         the associated protocol. It is recognized only by
                         Switch_Network_Interface_For_HPS adapters.

                         Optional.


SELECT
      Close to return to the Build a Job dialog box.
SELECT
      Requirements
            The Requirements dialog box appears.
           Complete those fields for which you want to specify requirements (see
           Table 58). Defaults are used for those fields that you leave blank.
           LoadLeveler dispatches your job only to one of those machines with
           resources that matches the requirements you specify.
Table 58. Build a job dialog box fields
Field               Input
Architecture        Machine type. The job will not run on any other machine type.

(see note 2)        Optional. The default is the architecture of your current machine.
Operating System Operating system. The job will not run on any other operating system.

(see note 2)        Optional. The default is the operating system of your current machine.
Disk                Amount of disk space in the execute directory. The job will only run on
                    a machine with at least this much disk space.

                    Optional. The default is defined in your local configuration file.
Memory              Amount of memory. The job will only run on a machine with at least
                    this much memory.

                    Optional. The default is defined in your local configuration file.
Large Page          Amount of Large Page memory, in megabytes. The job step requires at
Memory              least this much Large Page memory to run.

                    Optional.
Total Memory        Amount of total (regular and Large Page memory) in megabytes needed
                    to run the job step.

                    Optional.
Machines            Machine names. The job will only run on the specified machines.

                    Optional.
Features            Features. The job will only run on machines with specified features.

                    Optional.



                      Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   245
Table 58. Build a job dialog box fields (continued)
                        Field               Input
                        Pool                Specifies the number associated with the pool you want to use. All
                                            available pools listed in the administration file appear as choices. The
                                            default is to select nodes from any pool.
                        LoadLeveler         Specifies the version of LoadLeveler, in dotted decimal format, on the
                        Version             machine where you want the job to run. For example: 3.3.0.0 specifies
                                            that your job will run on a machine running LoadLeveler Version 3.3.0.0
                                            or higher.

                                            Optional.
                        Connectivity        A number from 0.0 through 1.0, representing the average connectedness
                                            of the node’s managed adapters.
                        Requirement         Requirements. The job will only run if these requirements are met.
                        Note:
                        1. If you enter a resource that is not available, you will NOT receive a message.
                           LoadLeveler holds your job in the Idle state until the resource becomes available.
                           Therefore, make certain that the spelling of your entry is correct. You can issue llq -s
                           jobID to find out if you have a job for which requirements were not met.
                        2. If you do not specify an architecture or operating system, LoadLeveler assumes that
                           your job can run only on your machine’s architecture and operating system. If your job
                           is not a shell script that can be run successfully on any platform, you should specify a
                           required architecture and operating system.


                        SELECT
                              Close to return to the Build a Job dialog box.
                        SELECT
                              Resources
                                   The Resources dialog box appears.
                                 This dialog box allows you to set the amount of defined consumable
                                 resources required for a job step. Resources with an ″*″ appended to their
                                 names are not in the SCHEDULE_BY_RESOURCES list. For more
                                 information, see the resources keyword.
                        SELECT
                              Close to return to the Build a Job dialog box.
                        SELECT
                              Preferences
                                   The Preferences dialog box appears.
                                 This dialog box is similar to the Requirements dialog box, with the
                                 exception of the Adapter choice, which is not supported as a Preference.
                                 Complete the fields for those parameters that you want to specify. These
                                 parameters are not binding. For any preferences that you specify,
                                 LoadLeveler attempts to find a machine that matches these preferences
                                 along with your requirements. If it cannot find the machine, LoadLeveler
                                 chooses the first machine that matches the requirements.
                        SELECT
                              Close to return to the Build a Job dialog box.
                        SELECT
                              Limits


246   TWS LoadLeveler: Using and Administering
The Limits dialog box appears.
         Complete the fields for those limits that you want to impose upon your job
         (see Table 59). If you type copy in any field except wall_clock_limit or
         job_cpu_limit, the limits in effect on the submit machine are used. If you
         leave any field blank, the default limits in effect for your userid on the
         machine that runs the job are used. For more information, see “Using limit
         keywords” on page 89.
Table 59. Limits dialog box fields
Field               Input
CPU Limit           Maximum amount of CPU time that the submitted job can use. Express
                    the amount as:
                    [[hours:]minutes:]seconds[ .fraction]

                    For example, 12:56:21 is 12 hours, 56 minutes, and 21 seconds.

                    Optional
Data Limit          Maximum amount of the data segment that the submitted job can use.
                    Express the amount as:
                    integer[.fraction][units]

                    Optional
Core Limit          Maximum size of a core file.

                    Optional
RSS Limit           Maximum size of the resident set size. It is the largest amount of
                    physical memory a user’s process can allocate.

                    Optional
File Limit          Maximum size of a file that is created.

                    Optional
Stack Limit         Maximum size of the stack.

                    Optional
Job CPU Limit
                    Maximum total CPU time to be used by all processes of a serial job step
                    or if a parallel job, then this is the total CPU time for each LoadL_starter
                    process and its descendants for each job step of a parallel job.

                    Optional
Wall Clock Limit    Maximum amount of elapsed time for which a job can run.

                    Optional


SELECT
      Close to return to the Build a Job dialog box.
SELECT
      Checkpointing to specify checkpoint options (available when the
      checkpoint option is set to Yes or Interval)
             The checkpointing dialog box appears.
         Complete those fields for which you want to specify checkpoint
         information (see Table 60 on page 248). For detailed information on specific
         keywords, see “Job command file keyword descriptions” on page 359.

                      Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   247
Table 60. Checkpointing dialog box fieldsF
                        Field               Input
                        Ckpt File           Specifies a checkpoint file. The serial default is :
                                            $(job_name).$(host).$(domain).$(jobid).$(stepid).ckpt
                        Ckpt Directory      Specifies a checkpoint directory name.
                        Ckpt Execute        Specifies a directory to use for staging the checkpoint executable file.
                        Directory
                        Ckpt Time Limits    Sets the limits for the elapsed time a job can take checkpointing.


                        SELECT
                              Close to return to the Build a Job dialog box.
                        SELECT
                              Blue Gene (available when the job type is bluegene)
                                    The Blue Gene window appears.
                                 Complete the necessary fields to specify information for a Blue Gene job
                                 (see Table 61). Depending upon which request type you choose, different
                                 fields will be available; any unavailable fields will be desensitized. For
                                 more information, see the appropriate job command file keyword (listed in
                                 parentheses) in “Job command file keyword descriptions” on page 359.
                        Table 61. Blue Gene job fields
                        Field               Available when      Input
                                            requesting by:
                        # of Compute        Size                The requested size in number of compute nodes that
                        Nodes                                   describes the size of the partition for this Blue Gene
                                                                job. (bg_size)
                        Shape               Shape               The requested shape of the requested Blue Gene job.
                                                                The units of each dimension of the shape are in
                                                                number of base partitions, XxYxZ, where X, Y, and Z
                                                                are the number of base partitions in the X-direction,
                                                                Y-direction, and Z-direction. (bg_shape)
                        Partition Name      Partition           The name of an existing partition in the Blue Gene
                                                                system where the requested job should run.
                                                                (bg_partition)
                        Connection Type     Size and Shape      The kinds of Blue Gene partitions that can be selected
                                                                for this job. You can select Torus, Mesh, or Prefer
                                                                Torus. (bg_connection)

                                                                Optional. The default is Mesh.
                        Rotate              Shape               Whether to consider all possible rotations of the
                        Dimensions                              specified shape (True) or only the specified shape
                                                                (False) when assigning a partition for the Blue Gene
                                                                job. (bg_rotate)

                                                                Optional. The default is True.




248   TWS LoadLeveler: Using and Administering
Table 61. Blue Gene job fields (continued)
              Field              Available when      Input
                                 requesting by:
              Memory             Megabytes           A number (in megabytes) that represents the
                                                     minimum available virtual memory that is needed to
                                                     run the job. LoadLeveler generates a Blue Gene
                                                     requirement that specifies memory that is greater
                                                     than or equal to the amount you specify.

                                                     Optional. If you leave this field blank, this parameter
                                                     is not used when searching for machines to run your
                                                     job.
              Requirements       Expression          An expression that specifies the Blue Gene
                                                     requirements that a machine must meet in order to
                                                     run the job.

                                                     Memory is the supported keyword.


              SELECT
                    Close to return to the Build a Job dialog box.

Editing the job command file
              Use these instructions to edit the job command file that you just built.

              There are several ways that you can edit the job command file that you just built:
              1. Using the Jobs window:
                   SELECT
                         File → Submit a Job
                             The Submit a Job dialog box appears.
                   SELECT
                         The job file you want to edit from the file column.
                   SELECT
                         Edit
                            Your job command file appears in a window. You can use any editor
                          to edit the job command file. The default editor is specified in your
                          .Xdefaults file.
                         If you have an icon manager, an icon may appear. An icon manager is a
                         program that creates a graphic symbol, displayed on a screen, that you
                         can point to with a device such as a mouse in order to select a
                         particular function or application. Select this icon to view your job
                         command file.
              2. Using the Tools Edit pull-down menus on the Build a Job window:
                 Using the Edit pull-down menu, you can modify the job command file. Your
                 choices appear in the Table 62:
              Table 62. Modifying the job command file with the Edit pull-down menu
              To                                                              Select
              Add a step to the job command file                              Add a Step or Add a First
                                                                              Step
              Delete a step from the job command file                         Delete a Step


                                   Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   249
Table 62. Modifying the job command file with the Edit pull-down menu (continued)
                        To                                                                   Select
                        Clear the fields in the Build a Job window                           Clear Fields
                        Select defaults to use in the fields                                 Set Field Defaults
                        Note: Other options include Go to Next Step, Go to Previous Step, and Go to Last Step that
                        allow you to edit various steps in the job command file.

                             Using the Tools pull-down menu, you can modify the job command file. Your
                             choices appear in Table 63:
                        Table 63. Modifying the job command file with the Tools pull-down menu
                        To                                                                   Select
                        Name the job                                                         Set Job Name
                        Specify a cluster, cluster list, or any cluster, if a multicluster   Set Cluster
                        environment is configured.
                        Open a window where you can enter a script file                      Append Script
                        Fill in the fields using another file                                Restore from File
                        View the job command file in a window                                View Entire Job
                        Determine which step you are viewing                                 What is step #
                        Start a new job command file                                         Start a new job

                             You can save and submit the information you entered by selecting the choices
                             shown in Table 64:
                        Table 64. Saving and submitting information
                        To                                 Do This
                        Save the information you
                                                           SELECT
                        entered into a file which you
                                                                     Save
                        can submit later
                                                                        A window appears prompting you to enter a job
                                                                     filename.
                                                           ENTER
                                                                     a job filename in the text entry field.
                                                           SELECT
                                                                     OK
                                                                       The window closes and the information you
                                                                     entered is saved in the file you specified.
                        Submit the program
                                                           SELECT
                        immediately and discard the
                                                                     Submit
                        information you entered



Submitting a job command file
                        After building a job command file, you can submit it to one or more machines for
                        processing.

                        To submit a job, from the Jobs window:
                        SELECT
                              File → Submit a Job


250   TWS LoadLeveler: Using and Administering
The Submit a Job dialog box appears.
              SELECT
                    The job file that you want to submit from the file column.
                      You can also use the filter field and the directories column to select the file
                      or you can type in the file name in the text entry field.
              SELECT
                    Submit
                       The job is submitted for processing.
                      You can now submit another job or you can press Close to exit the
                      window.

Displaying and refreshing job status
              When you submit a job, the status of the job is automatically displayed in the Jobs
              window.

              You can update or refresh this status using the Jobs window and selecting one of
              the following:
              v Refresh → Refresh Jobs
              v Refresh → Refresh All.

              To change how often the amount of time should pass before the jobs window is
              automatically refreshed, use the Jobs window.
              SELECT
                    Refresh → Set Auto Refresh
                       A window appears.
              TYPE IN
                    a value for the number of seconds to pass before the Jobs window is
                    updated.
                      Automatic refresh can be expensive in terms of network usage and CPU
                      cycles. You should specify a refresh interval of 120 seconds or more for
                      normal use.
              SELECT
                    OK
                       The window closes and the value you specified takes effect.

              To receive detailed information on a job:
              SELECT
                    Actions → Extended Status to receive additional information on the job.
                    Selecting this option is the same as typing llq -x command.
                      You can also get information in the following way:
              SELECT
                    Actions → Extended Details
                      Selecting this option is the same as typing llq -x -l command. You can also
                      double click on the job in the Jobs window to get details on the job.
                      Note: Obtaining extended status or details on multiple jobs can be
                      expensive in terms of network usage and CPU cycles.

                                  Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   251
SELECT
                              Actions → Job Status
                                 You can also use the llq -s command to determine why a submitted job
                                 remains in the Idle or Deferred state.
                        SELECT
                              Actions → Resource Use
                                 Allows you to display resource use for running jobs. Selecting this option
                                 is the same as entering the llq -w command.
                        SELECT
                              Actions → Blue Gene Job Status
                                 Allows you to display Blue Gene job information for jobs. Selecting this
                                 option is the same as entering the llq -b command.

                        For more information on requests for job information, see “llq - Query job status”
                        on page 479.

Sorting the Jobs window
                        You can specify up to two sorting options for the Jobs window.

                        The options you specify determine the order in which the jobs appear in the Jobs
                        window.

                        From the Jobs window:
                        Select Sort → Set Sort Parameters
                                   A window appears
                        Select A primary and secondary sort

                        Table 65 lists the sorting options:
                        Table 65. Sorting the jobs window
                        To:                                                             Select Sort
                        Sort jobs by the machine from which they were                   Sort by Submitting Machine
                        submitted
                        Sort by owner                                                   Sort by Owner
                        Sort by the time the jobs were submitted                        Sort by Submission Time
                        Sort by the state of the job                                    Sort by State
                        Sort jobs by their user priority (last job listed runs first)   Sort by Priority
                        Sort by the class of the job                                    Sort by Class
                        Sort by the group associated with the job                       Sort by Group
                        Sort by the machine running the job                             Sort by Running Machine
                        Sort by dispatch order                                          Sort by Dispatch Order
                        Not specify a sort                                              No Sort


                        You can select a sort type as either a Primary or Secondary sorting option. For
                        example, suppose you select Sort by Owner as the primary sorting option and Sort
                        by Class as the secondary sorting option. The Jobs window is sorted by owner
                        and, within each owner, by class.

252   TWS LoadLeveler: Using and Administering
Changing the priority of your jobs
               If your job has not yet begun to run and is still in the queue, you can change the
               priority of the job in relation to your other jobs in the queue that belong to the
               same class.

               This only affects the user priority of the job. For more information on this priority,
               refer to “Setting and changing the priority of a job” on page 230. Only the owner
               of a job or the LoadLeveler administrator can change the priority of a job.

               From the Jobs window:
               SELECT
                     a job by clicking on it with the mouse
               SELECT
                     Actions → Priority
                         A window appears.
               TYPE IN
                     a number between 0 and 100, inclusive, to indicate a new priority.
               SELECT
                     OK
                         The window closes and the priority of your job changes.

Placing a job on hold
               Only the owner of a job or the LoadLeveler administrator can place a hold on a
               job.

               From the Jobs window:
               SELECT
                     The job you want to hold by clicking on it with the mouse
               SELECT
                     Actions → Hold
                         The job is put on hold and its status changes in the Jobs window.

Releasing the hold on a job
               Only the owner of a job or the LoadLeveler administrator can release a hold on a
               job.

               From the Jobs window:
               SELECT
                     The job you want to release by clicking on it with the mouse
               SELECT
                     Actions → Release from Hold
                        The job is released from hold and its status is updated in the Jobs
                       window.




                                   Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   253
Canceling a job
                        Only the owner of a job or the LoadLeveler administrator can cancel a job.

                        From the Jobs window:
                        SELECT
                              The job you want to cancel by clicking on it with the mouse
                        SELECT
                              Actions → Cancel
                                   LoadLeveler cancels the job and the job information disappears from the
                                 Jobs window.

Modifying consumable resources and other job attributes
                        Use these commands to modify the consumable CPUs or memory requirements of
                        a nonrunning job.
                        SELECT

                                 Modify    → Consumable CPUs
                                 or
                                 Modify    → Consumable Memory
                                 or
                                 Modify    → Class
                                 or
                                 Modify    → Account number
                                 or
                                 Modify    → Blue Gene → Connection
                                 or
                                 Modify    → Blue Gene → Partition
                                 or
                                 Modify    → Blue Gene → Rotate
                                 or
                                 Modify    → Blue Gene → Shape
                                 or
                                 Modify    → Blue Gene → Size
                                 or
                                 Modify    → Blue Gene → Requirement

                                   A dialog box appears prompting you to enter a new value for the
                                 selected job attribute. Blue Gene attributes are available when Blue Gene is
                                 enabled.
                        TYPE IN
                              The new value
                        SELECT
                              OK
                                   The dialog box closes and the value you specified takes effect.

Taking a checkpoint
                        Use these commands to checkpoint the selected job.




254   TWS LoadLeveler: Using and Administering
SELECT
                     One of the following actions to take when checkpoint has completed:
                     v Continue the step
                     v Terminate the step
                     v Hold the step
                         A checkpoint monitor for this step appears.

Adding a job to a reservation
               Use these commands to bind selected job steps to a reservation so that they will
               only be scheduled to run on the nodes reserved for the reservation.
               SELECT
                     The job you want to bind by clicking on it with the mouse.
               SELECT
                     Actions → Bind to Reservation
                         A window appears.
               SELECT
                     A reservation from the list.
               SELECT
                     OK
                         The window closes and the job is bound to that reservation.

Removing a job from a reservation
               Use these commands to unbind selected job steps from reservations to which they
               currently belong.
               SELECT
                     The job you want to unbind by clicking on it with the mouse.
               SELECT
                     Actions → Unbind from Reservation

               If the job is bound to a reservation, it is removed from the reservation.

Displaying and refreshing machine status
               The status of the machines is automatically displayed in the Machines window.

               You can update or refresh this status using the Machines window and selecting
               one of the following:
               v Refresh → Refresh Machines
               v Refresh → Refresh All.

               To specify an amount of time to pass before the Machines window is automatically
               refreshed, from the Machines window:
               SELECT
                     Refresh → Set Auto Refresh
                         A window appears.




                                   Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   255
TYPE IN
                              a value for the number of seconds to pass before the Machines window is
                              updated.
                                 Automatic refresh can be expensive in terms of network usage and CPU
                                 cycles. You should specify a refresh interval of 120 seconds or more for
                                 normal use.
                        SELECT
                              OK
                                   The window closes and the value you specified takes effect.

                        To receive detailed information on a machine:
                        SELECT
                              Actions → Details
                                 This displays status information about the selected machines. Selecting this
                                 option has the same effect as typing the llstatus -l command
                        SELECT
                              Actions → Adapter Details
                                 This displays virtual and physical adapter information for each selected
                                 machine. Selecting this option has the same effect as typing the llstatus -a
                                 command
                        SELECT
                              Actions → Floating Resources
                                 This displays consumable resources for the LoadLeveler cluster. Selecting
                                 this option has the same effect as typing the llstatus -R command
                        SELECT
                              Actions → Machine Resources
                                 This displays consumable resources defined for the selected machines or all
                                 machines. Selecting this option has the same effect as typing the llstatus -R
                                 command
                        SELECT
                              Actions → Cluster Status
                                 This displays status of machines in the defined cluster or clusters. It
                                 appears only when a multicluster environment is configured and is
                                 equivalent to the llstatus -X all command.
                        SELECT
                              Actions → Cluster Config
                                 This displays cluster information from the LoadL_admin file. Only fields
                                 with data specified or which have defaults when not specified are
                                 displayed. It appears only when a multicluster environment is configured
                                 and is equivalent to the llstatus -C command.
                        SELECT
                              Actions → Blue Gene ...
                                 This displays information about the Blue Gene system. You can select the
                                 option for Status for a short listing, Details for a long listing, Base
                                 Partitions for Blue Gene base partition status, or Partitions for existing




256   TWS LoadLeveler: Using and Administering
Blue Gene partition status. It is available only when Blue Gene support is
                        enabled in LoadLeveler. This is equivalent to the llstatus command with
                        the options -b, -b -l, -B, or -P.

Sorting the Machines window
               You can specify up to two sorting options for the Machines window.

               The options you specify determine the order in which machines appear in the
               window.

               From the Machines window:
               Select Sort → Set Sort Parameters
                          A window appears
               Select A primary and secondary sort

               Table 66 lists sorting options for the Machines window:
               Table 66. Sorting the machines window
               To:                                                               Select Sort →
               Sort by machine name                                              Sort by Name
               Sort by Schedd state                                              Sort by Schedd
               Sort by total number of jobs scheduled                            Sort by InQ
               Sort by number of running jobs scheduled by this machine          Sort by Act
               Sort by startd state                                              Sort by Startd
               Sort by the number of jobs running on this machine                Sort by Run
               Sort by load average                                              Sort by LdAvg
               Sort by keyboard idle time                                        Sort by Idle
               Sort by hardware architecture                                     Sort by Arch
               Sort by operating system type                                     Sort by OpSys
               Not specify a sort                                                No Sort


               You can select a sort type as either a Primary or Secondary sorting option. For
               example, suppose you select Sort by Arch as the primary sorting option and Sort
               by Name as the secondary sorting option. The Machines window is sorted by
               hardware architecture, and within each architecture type, by machine name.

Finding the location of the central manager
               The LoadLeveler administrator designates one of the nodes in the LoadLeveler
               cluster as the central manager.

               When jobs are submitted at any node, the central manager is notified and decides
               where to schedule the jobs. In addition, it keeps track of the status of machines in
               the cluster and the jobs in the system by communicating with each node.
               LoadLeveler uses this information to make the scheduling decisions and to
               respond to queries.

               To find the location of the central manager, from the Machines window:


                                      Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   257
SELECT
                              Actions → Find Central Manager
                                   A message appears in the message window declaring on which machine
                                 the central manager is located.

Finding the location of the public scheduling machines
                        Public scheduling machines are those machines that participate in the scheduling
                        of LoadLeveler jobs on behalf of the submit-only machines.

                        To get a list of these machines in your cluster, use the Machines window:
                        SELECT
                              Actions → Find Public Scheduler
                                   A message appears displaying the names of these machines.

Finding the type of scheduler in use
                        The LoadLeveler administrator defines the scheduler used by the cluster.

                        To determine which scheduler is currently in use:
                        SELECT
                              Actions → Find Scheduler Type
                                   A message appears displaying the type:
                                 v ll_default
                                 v BACKFILL
                                 v External (API)

Specifying which jobs appear in the Jobs window
                        Normally, only your jobs appear in the Jobs window.

                        You can, however, specify which jobs you want to appear by using the Select
                        pull-down menu on the Jobs window (see Table 67).
                        Table 67. Specifying which jobs appear in the Jobs window
                        To Display                         Select Select →
                        All jobs in the queue              All
                        All jobs belonging to a specific   By User
                        user (or users)
                                                            A window appears prompting you to enter the user IDs
                                                           whose jobs you want to view.
                        All jobs submitted to a specific By Machine
                        machine (or machines)
                                                          A window appears prompting you to enter the machine
                                                         names on which the jobs you want to view are running.
                        All jobs belonging to a specific   By Group
                        group (or groups)
                                                             A window appears prompting you to enter the
                                                           LoadLeveler group names to which the jobs you want to
                                                           view belong.




258   TWS LoadLeveler: Using and Administering
Table 67. Specifying which jobs appear in the Jobs window (continued)
              To Display                        Select Select →
              All jobs having a particular ID   By Job Id

                                                A dialog box prompts you to enter the id of the job you
                                                want to appear. This ID appears in the left column of the
                                                Jobs window. Type in the ID and press OK.
              Note: When you choose By User, By Machines, or By Group, you can use a UNIX regular
              expression enclosed in parenthesis. For example, you can enter (^k10) to display all
              machines beginning with the characters “k10”.


              SELECT
                    Select → Show Selection to show the selection parameters.

Specifying which machines appear in Machines window
              You can specify which machines will appear in the Machines window.

              See Table 68. The default is to view all of the machines in the LoadLeveler pool.

              From the Machines window:
              Table 68. Specifying which machines appear in Machines window
              To                                  Select Select →
              View all of the machines            All
              View machines by operating          by OpSys
              system
                                                   A window appears prompting you to enter the
                                                  operating system of those machines you want to view.
              View machines by hardware           by Arch
              architecture
                                                    A window appears prompting you to enter the
                                                  hardware architecture of those machines you want to
                                                  view.
              View machines by state              by State

                                                    A cascading pull-down menu appears prompting you
                                                  to select the state of the machines that you want to view.


              SELECT
                    Select → Show Selection to show the selection parameters.

Saving LoadLeveler messages in a file
              Normally, all the messages that LoadLeveler generates appear in the Messages
              window.

              If you would also like to have these messages written to a file, use the Messages
              window.
              SELECT
                    Actions → Start logging to a file
                        A window appears prompting you to enter a filename in which to log
                      the messages.


                                    Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs   259
TYPE IN
                              The filename in the text entry field.
                        SELECT
                              OK
                                   The window closes.




260   TWS LoadLeveler: Using and Administering
Part 4. TWS LoadLeveler interfaces reference
            The topics in the TWS LoadLeveler interfaces reference provide the details you
            need to know to correctly use the IBM Tivoli Workload Scheduler (TWS)
            LoadLeveler interfaces for the following tasks:
            v Specifying keywords in the TWS LoadLeveler control files
            v Starting and customizing the TWS LoadLeveler GUI
            v Correctly coding the TWS LoadLeveler commands and APIs




                                                                                             261
262   TWS LoadLeveler: Using and Administering
Chapter 12. Configuration file reference
               The configuration file contains many parameters that you can set or modify to
               control how LoadLeveler operates.

               You may control LoadLeveler’s operation either:
               v Across the cluster, by modifying the global configuration file, LoadL_config, or
               v Locally, by modifying the LoadL_config.local file on individual machines.

               Table 69 shows the configuration subtasks:
               Table 69. Configuration subtasks
               Subtask                              Associated information (see . . . )
               To find out what administrator tasks Chapter 4, “Configuring the LoadLeveler
               you can accomplish by using the      environment,” on page 41
               configuration file
               To learn how to correctly specify the v “Configuration file syntax”
               contents of a configuration file
                                                     v “Configuration file keyword descriptions” on page
                                                       265
                                                    v “User-defined keywords” on page 313
                                                    v “LoadLeveler variables” on page 314



Configuration file syntax
               The information in both the LoadL_config and the LoadL_config.local files is in
               the form of a statement. These statements are made up of keywords and values.

               There are three types of configuration file keywords:
               v Keywords, described in “Configuration file keyword descriptions” on page 265.
               v User-defined variables, described in “User-defined keywords” on page 313.
               v LoadLeveler variables, described in “LoadLeveler variables” on page 314.

               Configuration file statements take one of the following formats:
               keyword=value
               keyword:value

               Statements in the form keyword=value are used primarily to customize an
               environment. Statements in the form keyword:value are used by LoadLeveler to
               characterize the machine and are known as part of the machine description. Every
               machine in LoadLeveler has its own machine description which is read by the
               central manager when LoadLeveler is started.

               Keywords are not case sensitive. This means you can enter them in lower case,
               upper case, or mixed case.

               Note: For the keyword=value form, if the keyword is of a boolean type and only
                     true and false are valid input, a value string starting with t or T is taken as
                     true; all other values are taken as false.

               To continue configuration file statements, use the back-slash character ().

                                                                                                     263
In the configuration file, comments must be on a separate line from keyword
                            statements.

                            You can use the following types of constants and operators in the configuration
                            file.

                 Numerical and alphabetical constants
                            These are the numerical and alphabetical constants.

                            Constants may be represented as:
                            v Boolean expressions
                            v Signed integers
                            v Floating point values
                            v Strings enclosed in double quotes (″ ″).

                 Mathematical operators
                            You can use the following C operators.

                            The operators are listed in order of precedence. All of these operators are evaluated
                            from left to right:
                            v !
                            v */
                            v -+
                            v < <= > >=
                            v == !=
                            v &&
                            v ||

                 64-bit support for configuration file keywords and expressions
                            Administrators can assign 64-bit integer values to selected keywords in the
                            configuration file.
                            floating_resources
                                Consumable resources associated with the floating_resources keyword may be
                                assigned 64-bit integer values. Fractional and unit specifications are not
|                               allowed. The predefined ConsumableCpus, ConsumableMemory,
|                               ConsumableLargePageMemory, and ConsumableVirtualMemory may not be
|                               specified as floating resources.
                                Example:
                                floating_resources = spice2g6(9876543210123) db2_license(1234567890)
                            MACHPRIO expression
|                             The LoadLeveler variables: Disk, ConsumableCpus, ConsumableMemory,
|                             ConsumableVirtualMemory, ConsumableLargePageMemory, PagesScanned,
|                             Memory, VirtualMemory, FreeRealMemory, and PagesFreed may be used in a
|                             MACHPRIO expression. They are 64-bit integers and 64-bit arithmetic is used
                              to evaluate them.
                                Example:
                                MACHPRIO: (Memory + FreeRealMemory) - (LoadAvg*1000 + PagesScanned)




    264   TWS LoadLeveler: Using and Administering
Configuration file keyword descriptions
              This topic provides an alphabetical list of the keywords you can use in a
              LoadLeveler configuration file.

              It also provides examples of statements that use these keywords.
              ACCT
                Turns the accounting function on or off.
                 Syntax:
                 ACCT = flag ...

                 The available flags are:
                 A_DETAIL
                      Enables extended accounting. Using this flag causes LoadLeveler to
                      record detail resource consumption by machine and by events for each
                      job step. This flag also enables the -x flag of the llq command,
                      permitting users to view resource consumption for active jobs.
                 A_RES
                           Turns reservation data recording on.
                 A_OFF
                           Turns accounting data recording off.
                 A_ON Turns accounting data recording on. If specified without the
                      A_DETAIL flag, the following is recorded:
                      v The total amount of CPU time consumed by the entire job
                      v The maximum memory consumption of all tasks (or nodes).
                 A_VALIDATE
                       Turns account validation on.
                 Default value: A_OFF
                 Example: This example specifies that accounting should be turned on and that
                 extended accounting data should be collected and that the -x flag of the llq
                 command be enabled.
                 ACCT = A_ON A_DETAIL
              ACCT_VALIDATION
                Identifies the executable called to perform account validation.
                 Syntax:
                 ACCT_VALIDATION = program

                 Where program is a validation program.
                 Default value: $(BIN)/llacctval (the accounting validation program shipped
                 with LoadLeveler.
              ACTION_ON_MAX_REJECT
                Specifies the state in which jobs are placed when their rejection count has
                reached the value of the MAX_JOB_REJECT keyword. HOLD specifies that
                jobs are placed in User Hold status; SYSHOLD specifies that jobs are placed in
                System Hold status; CANCEL specifies that jobs are canceled. When a job is
                rejected, LoadLeveler sends a mail message stating why the job was rejected.
                 Syntax:
                 ACTION_ON_MAX_REJECT = HOLD | SYSHOLD | CANCEL

                                                            Chapter 12. Configuration file reference   265
Default value: HOLD
                            ACTION_ON_SWITCH_TABLE_ERROR
                              Points to an administrator supplied program that will be run when
                              DRAIN_ON_SWITCH_TABLE_ERROR is set to true and a switch table
                              unload error occurs.
                                Syntax:
                                ACTION_ON_SWITCH_TABLE_ERROR = program
                                Default value: The default is to not run a program.
                            ADMIN_FILE
                              Points to the administration file containing user, class, group, machine, and
                              adapter stanzas.
                                Syntax:
                                ADMIN_FILE = directory
                                Default value: $(tilde)/admin_file
                            AFS_GETNEWTOKEN
                               Specifies a filter that, for example, can be used to refresh an AFS token.
                                Syntax:
                                AFS_GETNEWTOKEN = full_path_to_executable

                                Where full_path_to_executable is an administrator-supplied program that
                                receives the AFS authentication information on standard input and writes the
                                new information to standard output. The filter is run when the job is
                                scheduled to run and can be used to refresh a token which expired when the
                                job was queued.
                                Default value: The default is to not run a program.
                            AGGREGATE_ADAPTERS
                              Allows an external scheduler to specify per-window adapter usages.
                                Syntax:
                                AGGREGATE_ADAPTERS = YES | NO
                                When this keyword is set to YES, the resources from multiple switch adapters
                                on the same switch network are treated as one aggregate pool available to each
                                job. When this keyword is set to NO, the switch adapters are treated
                                individually and a job cannot use resources from multiple adapters on the
                                same network.
                                Set this keyword to NO when you are using an external scheduler; otherwise,
                                set to YES (or accept the default).
                                Default value: YES
|                           ALLOC_EXCLUSIVE_CPU_PER_JOB
|                              Specifies the way CPU affinity is enforced on Linux platforms. When this
|                              keyword is not specified or when an unrecognized value is assigned to it,
|                              LoadLeveler will not attempt to set CPU affinity for any application processes
|                              spawned by it.

|                               Note: This keyword is valid only on Linux x86 and x86_64 platforms. This
|                                     keyword is ignored by LoadLeveler on all other platforms.
|                               The ALLOC_EXCLUSIVE_CPU_PER_JOB keyword can be specified in the
|                               global or local configuration files. It can also be specified in both configuration

    266   TWS LoadLeveler: Using and Administering
|      files, in which case the setting in the local configuration file will override that
|      of the global configuration file. The keyword cannot be turned off in a local
|      configuration file if it has been set to any value in the global configuration file.
|      Changes to ALLOC_EXCLUSIVE_CPU_PER_JOB will not take effect at
|      reconfiguration. The administrator must stop and restart or recycle
|      LoadLeveler when changing ALLOC_EXCLUSIVE_CPU_PER_JOB.
|      Syntax:
|      ALLOC_EXCLUSIVE_CPU_PER_JOB = LOGICAL|PHYSICAL
|      Default value: By default, when this keyword is not specified, CPU affinity is
|      not set.
|      Example: When the value of this keyword is set to LOGICAL, only one
|      LoadLeveler job step will run on each of the processors available on the
|      machine:
|      ALLOC_EXCLUSIVE_CPU_PER_JOB = LOGICAL
|      Example: When the value of this keyword is set to PHYSICAL, all logical
|      processors (or physical cores) configured in one physical CPU package will be
|      allocated to one and only one LoadLeveler job step.
|      ALLOC_EXCLUSIVE_CPU_PER_JOB = PHYSICAL
    ARCH
      Indicates the standard architecture of the system. The architecture you specify
      here must be specified in the same format in the requirements and preferences
      statements in job command files. The administrator defines the character string
      for each architecture.
       Syntax:
       ARCH = string
       Default value: Use the command llstatus -l to view the default.
       Example: To define a machine as an RS/6000®, the keyword would look like:
         ARCH = R6000
    BG_ALLOW_LL_JOBS_ONLY
       Specifies if only jobs submitted through LoadLeveler will be accepted by the
       Blue Gene job launcher program.
       Syntax:
       BG_ALLOW_LL_JOBS_ONLY = true | false
       Default value: false
    BG_CACHE_PARTITIONS
       Specifies whether allocated partitions are to be reused for Blue Gene jobs
       whenever possible.
       Syntax:
       BG_CACHE_PARTITIONS = true | false
       Default value: true
    BG_ENABLED
       Specifies whether Blue Gene support is enabled.
       Syntax:
       BG_ENABLED = true | false




                                                   Chapter 12. Configuration file reference   267
If the value of this keyword is true, the central manager will load the Blue
                            Gene control system libraries and query the state of the Blue Gene system so
                            that jobs of type bluegene can be scheduled.
                            Default value: false
                        BG_MIN_PARTITION_SIZE
                           Specifies the smallest number of compute nodes in a partition.
                            Syntax:
                            BG_MIN_PARTITION_SIZE = 32 | 128 | 512 (for Blue Gene/L)

                            BG_MIN_PARTITION_SIZE = 16 | 32 | 64 | 128 | 256 | 512 (for Blue Gene/P)

                            The value for this keyword must not be smaller than the minimum partition
                            size supported by the physical Blue Gene hardware. If the number of compute
                            nodes requested in a job is less than the minimum partition size, LoadLeveler
                            will increase the requested size to the minimum partition size.
                            If the max_psets_per_bp value is set in the DB_PROPERTY file, the value for
                            the BG_MIN_PARTITION_SIZE must be set as described in Table 70:
Table 70. BG_MIN_PARTITION_SIZE values
max_psets_per_bp value in             BG_MIN_PARTITION_SIZE for           BG_MIN_PARTITION_SIZE for
DB_PROPERTY file                      Blue Gene/L                         Blue Gene/P
4                                     >= 128                              >= 128
8                                     >= 128                              >= 64
16                                    >= 32                               >= 32
32                                    >= 32                               >= 16


                            Default value: 32
                        BIN
                           Defines the directory where LoadLeveler binaries are kept.
                            Syntax:
                            BIN = $(RELEASEDIR)/bin
                            Default value: $(tilde)/bin
                        CENTRAL_MANAGER_HEARTBEAT_INTERVAL
                           Specifies the amount of time, in seconds, that defines how frequently primary
                           and alternate central manager communicate with each other.
                            Syntax:
                            CENTRAL_MANAGER_HEARTBEAT_INTERVAL = number
                            Default value: The default is 300 seconds or 5 minutes.
                        CENTRAL_MANAGER_TIMEOUT
                           Specifies the number of heartbeat intervals that an alternate central manager
                           will wait before declaring that the primary central manager is not operating.
                            Syntax:
                            CENTRAL_MANAGER_TIMEOUT = number
                            Default value: The default is 6.
                        CKPT_CLEANUP_INTERVAL
                          Specifies the interval, in seconds, at which the Schedd daemon will run the
                          program specified by the CKPT_CLEANUP_PROGRAM keyword.

268   TWS LoadLeveler: Using and Administering
Syntax:
   CKPT_CLEANUP_INTERVAL = number

   number must be a positive integer.
   Default value: -1
CKPT_CLEANUP_PROGRAM
  Identifies an administrator-provided program which is to be run at the interval
  specified by the ckpt_cleanup_interval keyword. The intent of this program is
  to delete old checkpoint files created by jobs running under LoadLeveler
  during the checkpoint process.
   Syntax:
   CKPT_CLEANUP_PROGRAM = program

   Where program is the fully qualified name of the program to be run. The
   program must be accessible and executable by LoadLeveler.
   A sample program to remove checkpoint files is provided in the
   /usr/lpp/LoadL/full/samples/llckpt/rmckptfiles.c file.
   Default value: No default value is set.
CKPT_EXECUTE_DIR
  Specifies the directory where the job step’s executable will be saved for
  checkpointable jobs. You can specify this keyword in either the configuration
  file or the job command file; different file permissions are required depending
  on where this keyword is set. For additional information, see “Planning
  considerations for checkpointing jobs” on page 140.
   Syntax:
   CKPT_EXECUTE_DIR = directory

   This directory cannot be the same as the current location of the executable file,
   or LoadLeveler will not stage the executable. In this case, the user must have
   execute permission for the current executable file.
   Default value: By default, the executable of a checkpointable job step is not
   staged.
CLASS
   Determines whether a machine will accept jobs of a certain job class. For
   parallel jobs, you must define a class instance for each task you want to run on
   a node using one of two formats:
   v The format, CLASS = class_name (count), defines the CLASS names using a
     statement that names the classes and sets the number of tasks for each class
     in parenthesis.
     With this format, the following rules apply:
     – Each class can have only one entry
     – If a class has more than one entry or there is a syntax error, the entire
        CLASS statement will be ignored
     – If the CLASS statement has a blank value or is not specified, it will be
        defaulted to No_Class (1)
     – The number of instances for a class specified inside the parenthesis ( )
        must be an unsigned integer. If the number specified is 0, it is correct
        syntactically, but the class will not be defined in LoadLeveler
     – If the number of instances for all classes in the CLASS statement are 0,
        the default No_Class(1) will be used


                                              Chapter 12. Configuration file reference   269
v The format, CLASS = { ″class1″ ″class2″ ″class2″ ″class2″}, defines the CLASS
                                  names using a statement that names each class and sets the number of tasks
                                  for each class based on the number of times that the class name is used
                                  inside the {} operands.

                                Note: With both formats, the class names list is blank delimited.
                                For a LoadLeveler job to run on a machine, the machine must have a vacancy
                                for the class of that job. If the machine is configured for only one No_Class job
                                and a LoadLeveler job is already running there, then no further LoadLeveler
                                jobs are started on that machine until the current job completes.
|                               You can have a maximum of 1024 characters in the class statement. You cannot
|                               use allclasses or data_stage as a class name, since these are reserved
|                               LoadLeveler keywords.
                                You can assign multiple classes to the same machine by specifying the classes
                                in the LoadLeveler configuration file (called LoadL_config) or in the local
                                configuration file (called LoadL_config.local). The classes, themselves, should
                                be defined in the administration file. See “Setting up a single machine to have
                                multiple job classes” on page 723 and “Defining classes” on page 89 for more
                                information on classes.
                                Syntax:
                                CLASS = { "class_name" ... } | {"No_Class"} | class_name (count) ...
                                Default value: {″No_Class″}
                            CLIENT_TIMEOUT
                               Specifies the maximum time, in seconds, that a daemon waits for a response
                               over TCP/IP from a process. If the waiting time exceeds the specified amount,
                               the daemon tries again to communicate with the process. In general, you
                               should use the default setting unless you are experiencing delays due to an
                               excessively loaded network. If so, you should try increasing this value.
                                Syntax:
                                CLIENT_TIMEOUT = number
                                Default value: The default is 30 seconds.
                            CLUSTER_METRIC
                               Indicates the installation exit to be run by the Schedd to determine where a
                               remote job is distributed. If a remote job is submitted with a list of clusters or
                               the reserved word any and the installation exit is not specified, the remote job
                               is not submitted.
                                Syntax:
                                CLUSTER_METRIC = full_pathname_to_executable

                                The installation exit is run with the following parameters passed as input. All
                                parameters are character strings.
                                v The job ID of the job to be distributed
                                v The number of clusters in the list of clusters
                                v A blank-delimited list of clusters to be considered
                                If the user specifies the reserved word any as the cluster_list during job
                                submission, the job is sent to the first outbound Schedd defined for the first
                                configured remote cluster. The CLUSTER_METRIC is executed on this
                                machine to determine where the job will be distributed. If this machine is not
                                the outbound_hosts Schedd for the assigned cluster, the job will be forwarded

    270   TWS LoadLeveler: Using and Administering
to the correct outbound_hosts Schedd. If the user specifies a list of clusters as
   the cluster_list during job submission, the job is sent to the first outbound
   Schedd defined for the first specified remote cluster. The CLUSTER_METRIC
   is executed on this machine to determine where the job will be distributed. If
   this machine is not the outbound_hosts Schedd for the assigned cluster, the job
   will be forwarded to the correct outbound_hosts Schedd.

   Note: The list of clusters may contain a single entry of the reserved word any,
          which indicates that the CLUSTER_METRIC installation exit must
          determine its own list of clusters to select from. This can be all of the
          clusters available using the data access API or a predetermined list set
          by the administrator. If any is specified in place of a cluster list, the
          metric will receive a count of 1 followed by the keyword any.
   The installation exit must write the remote cluster name to which the job is
   submitted as standard output and exit with a value of 0. An exit value of -1
   indicates an error in determining the cluster for distribution and the job is not
   submitted. Returned cluster names that are not valid also cause the job to be
   not submitted. STDERR from the exit is written to the Schedd log.
   LoadLeveler provides a set of sample exits for use in distributing jobs by the
   following metrics:
   v The number of jobs in the idle queue
   v The number of jobs in the specified class
   v The number of free nodes in the cluster
   The installation exit samples are available in the ${RELEASEDIR}/samples/
   llcluster directory.
CLUSTER_REMOTE_JOB_FILTER
   Indicates the installation exit to be run by the inbound Schedd for each remote
   job request to filter the user’s job command file statements during submission
   or move job. If the keyword is not specified, no job filtering is done.
   Syntax:
   CLUSTER_REMOTE_JOB_FILTER = full_pathname_to_executable

   The installation exit is run with the submitting user’s ID. All parameters are
   character strings.
   This installation exit is executed on the inbound_hosts of the local cluster
   when receiving a job submission or move job request.
   The executable specified is called with the submitting user’s unfiltered job
   command file statements as the standard input. The standard output is
   submitted to LoadLeveler. If the exit returns with a nonzero exit code, the
   remote job submission or job move will fail. A submit filter can only make
   changes to LoadLeveler job command file statements.
   The data access API can be used by the remote job filter to query the Schedd
   for the job object received from the sending cluster.
   If the local submission filter on the submitting cluster has added or deleted
   steps from the original user’s job command file, the remote job filter must add
   or delete the same number of steps. The job command file statements returned
   by the remote job filter must contain the same number of steps as the job
   object received from the sending cluster.
   Changes to the following job command file keyword statements are ignored:
   v executable

                                              Chapter 12. Configuration file reference   271
v   environment
                                v   image_size
                                v   cluster_input_file
                                v   cluster_output_file
                                v   cluster_list
                                The following job command file keyword will have different behavior:
                                v initialdir – If not set by the remote job filter or the submitting user’s
                                  unfiltered job command file, the default value will remain the current
                                  working directory at the time the job was submitted. Access to the initialdir
                                  will be verified on the cluster selected to run the job. If access to initialdir
                                  fails, the submission or move job will fail.
|                               When you distribute a scale-across job to other clusters for scheduling and a
|                               remote job filter is configured, the filter will be applied to the distributed job.
|                               However, only changes to the following job command file keyword statements
|                               will be accepted. Changes to any other statement by the remote job filter will
|                               be ignored.
|                               v #@ class
|                               v #@ priority
|                               v #@ as_limit
|                               v #@ core_limit
|                               v #@ cpu_limit
|                               v #@ data_limit
|                               v #@ file_limit
|                               v #@ job_cpu_limit
|                               v #@ locks_limit
|                               v #@ memlock_limit
|                               v #@ nofile_limit
|                               v #@ nproc_limit
|                               v #@ rss_limit
|                               v #@ stack_limit
                                To maintain compatibility between the SUBMIT_FILTER and
                                CLUSTER_REMOTE_JOB_FILTER programs, the following environment
                                variables are set when either exit is invoked:
                                v LOADL_ACTIVE – the LoadLeveler version.
                                v LOADL_STEP_COMMAND – the location of the job command file passed
                                  as input to the program. This job command file only contains LoadLeveler
                                  keywords.
                                v LOADL_STEP_ID – The job identifier, generated by the submitting
                                  LoadLeveler cluster.

                                  Note: The environment variable name is LOADL_STEP_ID although the
                                        value it contains is a ″job″ identifier. This name is used to be
                                        compatible with the local job filter interface.
                                v LOADL_STEP_OWNER – The owner (UNIX user name) of the job.
                            CLUSTER_USER_MAPPER
                               Indicates the installation exit to be run by the inbound Schedd for each remote




    272   TWS LoadLeveler: Using and Administering
job request to determine the user mapping of the cluster. This keyword implies
   that user mapping is performed. If the keyword is not specified, no user
   mapping is done.
   Syntax:
   CLUSTER_USER_MAPPER = full_pathname_to_executable

   The installation exit is run with the following parameters passed as input. All
   parameters are character strings.
   v The user name to be mapped
   v The cluster name where the user originated from
   This installation exit is executed on the inbound_hosts of the local cluster
   when receiving a job submission, move job request or remote command.
   The installation exit must write the new user name as standard output and exit
   with a value of 0. An exit value of -1 indicates an error and the job is not
   submitted. STDERR from the exit is written to the Schedd log. An exit value of
   1 indicates that the user name returned for this job was not mapped.
CM_CHECK_USERID
  Specifies whether the central manager will check the existence of user IDs that
  sent requests through a command or API on the central manager machine.
   Syntax:
   CM_CHECK_USERID = true | false
   Default value: true
COLLECTOR_DGRAM_PORT
  Specifies the port number used when connecting to a daemon.
   Syntax:
   CM_COLLECTOR_PORT = port number
   Default value: The default is 9612.
COMM
  Specifies a local directory where LoadLeveler keeps special files used for UNIX
  domain sockets for communicating among LoadLeveler daemons running on
  the same machine. This keyword allows the administrator to choose a different
  file system other than /tmp for these files. If you change the COMM option
  you must stop and then restart LoadLeveler using the llctl command.
   Syntax:
   COMM = local directory
   Default value: The default location for the files is /tmp.
CONTINUE
  Determines whether suspended jobs should continue execution.
   Syntax:
   CONTINUE: expression that evaluates to T or F (true or false)

   When T, suspended LoadLeveler jobs resume execution on the machine.

   Default value: No default value is set.
   For information about time-related variables that you may use for this
   keyword, see “Variables to use for setting times” on page 320.


                                              Chapter 12. Configuration file reference   273
CUSTOM_METRIC
                          Specifies a machine’s relative priority to run jobs.
                            Syntax:
                            CUSTOM_METRIC = number

                            This is an arbitrary number which you can use in the MACHPRIO expression.
                            Negative values are not allowed.
                            Default value: If you specify neither CUSTOM_METRIC nor
                            CUSTOM_METRIC_COMMAND, CUSTOM_METRIC = 1 is assumed. For
                            more information, see “Setting negotiator characteristics and policies” on page
                            45.
                            For more information related to using this keyword, see “Defining a
                            LoadLeveler cluster” on page 44.
                        CUSTOM_METRIC_COMMAND
                          Specifies an executable and any required arguments. The exit code of this
                          command is assigned to CUSTOM_METRIC. If this command does not exit
                          normally, CUSTOM_METRIC is assigned a value of 1. This command is
                          forked every (POLLING_FREQUENCY * POLLS_PER_UPDATE) period.
                            Syntax:
                            CUSTOM_METRIC_COMMAND = command
                            Default value: No default is set; LoadLeveler does not run any command to
                            determine CUSTOM_METRIC.
                        DCE_AUTHENTICATION_PAIR
                          Specifies a pair of installation supplied programs that are used to authenticate
                          DCE security credentials.
                            Restriction: DCE security is not supported by LoadLeveler for Linux.
                            Syntax:
                            DCE_AUTHENTICATION_PAIR = program1, program2

                            Where program1 and program2 are LoadLeveler- or installation-supplied
                            programs that are used to authenticate DCE security credentials. program1
                            obtains a handle (an opaque credentials object), at the time the job is
                            submitted, which is used to authenticate to DCE. program2 uses the handle
                            obtained by program1 to authenticate to DCE before starting the job on the
                            executing machines.
                            Default value: See “Handling DCE security credentials” on page 74 for
                            information about defaults.
                        DEFAULT_PREEMPT_METHOD
                           Specifies the default preemption method for LoadLeveler to use when a
                           preempt method is not specified in a PREEMPT_CLASS statement or in the
                           llpreempt command. LoadLeveler also uses this default preemption method to
                           preempt job steps that are running on reserved machines when a reservation
                           period begins.
                            Restrictions:
                            v This keyword is valid only for the BACKFILL scheduler.
                            v The suspend method of preemption (the default) might not be supported on
                              your level of Linux. If you want to preempt jobs that are running where
                              process tracking is not supported, you must use this keyword to specify a
                              method other than suspend.

274   TWS LoadLeveler: Using and Administering
Syntax:
       DEFAULT_PREEMPT_METHOD =     rm | sh | su | vc | uh

       Valid values are:
       rm
           LoadLeveler preempts the jobs and removes them from the job queue. To
           rerun the job, the user must resubmit the job to LoadLeveler.
       sh LoadLeveler ends the jobs and puts them into System Hold state. They
           remain in that state on the job queue until an administrator releases them.
           After being released, the jobs go into Idle state and will be rescheduled to
           run as soon as resources for the job are available.
       su LoadLeveler suspends the jobs and puts them in Preempted state. They
           remain in that state on the job queue until the preempting job has
           terminated, and resources are available to resume the preempted job on the
           same set of nodes. To use this value, process tracking must be enabled.
       vc LoadLeveler ends the jobs and puts them in Vacate state. They remain in
           that state on the job queue and will be rescheduled to run as soon as
           resources for the job are available.
       uh LoadLeveler ends the jobs and puts them into User Hold state. They
           remain in that state on the job queue until an administrator releases them.
           After being released, the jobs go into Idle state and will be rescheduled to
           run as soon as resources for the job are available.
       Default value: su (suspend method)
       For more information related to using this keyword, see “Steps for configuring
       a scheduler to preempt jobs” on page 130.
    DRAIN_ON_SWITCH_TABLE_ERROR
      Specifies whether the startd should be drained when the switch table fails to
      unload. This will flag the administrator that intervention may be required to
      unload the switch table. When DRAIN_ON_SWITCH_TABLE_ERROR is set
      to true, the startd will be drained when the switch table fails to unload.
       Syntax:
       DRAIN_ON_SWITCH_TABLE_ERROR = true | false
       Default value: false
|   DSTG_MAX_STARTERS
|      Specifies a machine-specific limit on the number of data staging initiators.
|      Since each task of a data staging job step consumes one initiator from the
|      data_stage class on the specified machine, DSTG_MAX_STARTERS provides
|      the maximum number of data staging tasks that can run at the same time on
|      the machine.
|      Syntax:
|      DSTG_MAX_STARTERS = number

|      Notes:
|                1. If you have not set the DSTG_MAX_STARTERS value in either the
|                   global or local configuration files, there will not be any data staging
|                   initiators on the specified machine. In this configuration, the
|                   compute node will not be allowed to perform data staging tasks.
|                2. The value specified for DSTG_MAX_STARTERS will be the
|                   number of initiators available for the built-in data_stage class on
|                   that machine.



                                                    Chapter 12. Configuration file reference   275
|                                         3. The value specified for MAX_STARTERS will not limit the value
|                                            specified for DSTG_MAX_STARTERS.
|                               Default value: 0
|                           DSTG_MIN_SCHEDULING_INTERVAL
|                              Specifies a minimum interval between scheduling inbound data staging job
|                              steps when they cannot be scheduled immediately. With a workload that
|                              involves a lot of data staging jobs, this keyword can be adjusted down from
|                              the default value of 900 seconds, if data staging jobs remain idle when there
|                              are data staging resources available. Setting this keyword to a smaller interval
|                              may impact scheduler performance when there is contention for data staging
|                              resources and a large number of idle jobs in the queue.
|                               Syntax:
|                               DSTG_MIN_SCHEDULING_INTERVAL = seconds

|                               Notes:
|                                         1. You can only specify this keyword in the global configuration file; it
|                                            will be ignored in local configuration files.
|                                         2. LoadLeveler ignores DSTG_MIN_SCHEDULING_INTERVAL
|                                            when DSTG_TIME=AT_SUBMIT.
|                               Default value: 900 seconds
|                           DSTG_TIME
|                              Specifies that either:
|                               AT_SUBMIT
|                                     LoadLeveler can schedule data staging steps any time after a job
|                                     requiring data staging has been submitted.
|                               JUST_IN_TIME
|                                     LoadLeveler must schedule data staging job steps as close as possible
|                                     to the application job steps that were submitted in the same job.
|                               Syntax:
|                               DSTG_TIME = AT_SUBMIT | JUST_IN_TIME

|                               Note: You can only specify the DSTG_TIME keyword in the global
|                                     configuration file. Any value specified for this keyword in local
|                                     configuration files will be ignored.
|                               Default value: AT_SUBMIT
                            ENFORCE_RESOURCE_MEMORY
                               Specifies whether the AIX Workload Manager is configured to limit, as
                               precisely as possible, the real memory usage of a WLM class. For this keyword
                               to be valid, ConsumableMemory must be set through the
                               ENFORCE_RESOURCE_USAGE keyword.
                                Syntax:
                                ENFORCE_RESOURCE_MEMORY = true | false
                                Default value: false
                            ENFORCE_RESOURCE_POLICY
                               Specifies what type of resource entitlements will be assigned to the AIX
                               Workload Manager classes. If the value specified is shares, it means a share
                               value is assigned to the class based on the job step’s requested resources (one
                               unit of resource equals one share). This is the default policy. If the value


    276   TWS LoadLeveler: Using and Administering
specified is soft, it means a percentage value is assigned to the class based on
       the job step’s requested resources and the total machine resources. This
       percentage can be exceeded if there is no contention for the resource. If the
       value specified is hard, it means a percentage value is assigned to the class
       based on the job step’s requested resources and the total machine resources.
       This percentage cannot be exceeded regardless of the contention for the
|      resource. This keyword is only valid for CPU and real memory with either
|      shares or percent limits. If desired, this keyword can be used in the
       LoadL_config.local file to set up a different policy for each machine. The
       ENFORCE_RESOURCE_USAGE keyword must be set for this keyword to be
       valid.
       Syntax:
       ENFORCE_RESOURCE_POLICY = hard |soft | shares
       Default value: shares
    ENFORCE_RESOURCE_SUBMISSION
       Indicates whether jobs submitted should be checked for the resources and
       node_resources keywords. If the value specified is true, LoadLeveler will
       check all jobs at submission time for the resources and node_resources
       keywords. The job command file resources and node_resources keywords
       combined need to have at least the resources specified in the
       ENFORCE_RESOURCE_USAGE keyword in order for the job to be submitted
       successfully. When RSET_MCM_AFFINITY is enabled, the task_affinity or
       parallel_threads keyword can be used instead of the resources and
       node_resources keywords when the resource being enforced is
       ConsumableCpus.
       If the value specified is false, no checking will be done and jobs submitted
       without the resources or node_resources keywords will not have resources
       enforced. In this instance, those jobs might interfere with other jobs whose
       resources are enforced.
       Syntax:
       ENFORCE_RESOURCE_SUBMISSION = true | false
       Default value: false
    ENFORCE_RESOURCE_USAGE
|      Specifies whether the AIX Workload Manager is used to enforce CPU and
|      memory resources. This keyword accepts either a value of deactivate or a list
|      of one or more of the following predefined resources:
       v ConsumableCpus
       v ConsumableMemory
|      v ConsumableVirtualMemory
|      v ConsumableLargePageMemory
       Either memory or CPUs or both can be enforced but the resources must also be
       specified on the SCHEDULE_BY_RESOURCES keyword. If deactivate is
       specified, LoadLeveler will deactivate AIX Workload Manager on all the nodes
       in the LoadLeveler cluster.

       Restriction: WLM enforcement is ignored by LoadLeveler for Linux.
       Syntax:
|      ENFORCE_RESOURCE_USAGE = name name ... name | deactivate




                                                    Chapter 12. Configuration file reference   277
EXECUTE
                           Specifies the local directory to store the executables of jobs submitted by other
                           machines.
                            Syntax:
                            EXECUTE = local directory/execute
                            Default value: $(tilde)/execute
                        FAIR_SHARE_INTERVAL
                           Specifies, in units of hours, the time interval it takes for resource usage in fair
                           share scheduling to decay to 5% of its initial value. Historic fair share data
                           collected before the most recent time interval of this length will have little
                           impact on fair share scheduling.
                            Syntax:
                            FAIR_SHARE_INTERVAL = hours
                            Default value: The default value is 168 hours (one week). If a negative value
                            or 0 is specified, the default value is used.
                        FAIR_SHARE_TOTAL_SHARES
                           Specifies the total number of shares that the cluster CPU or Blue Gene
                           resources are divided into. If this value is less than or equal to 0, fair share
                           scheduling is turned off.
                            Syntax:
                            FAIR_SHARE_TOTAL_SHARES = shares
                            Default value: The default value is 0.
                        FEATURE
                           Specifies an optional characteristic to use to match jobs with machines. You can
                           specify unique characteristics for any machine using this keyword. When
                           evaluating job submissions, LoadLeveler compares any required features
                           specified in the job command file to those specified using this keyword. You
                           can have a maximum of 1024 characters in the feature statement.
                            Syntax:
                            Feature = {"string" ...}
                            Default value: No default value is set.
                            Example: If a machine has licenses for installed products ABC and XYZ in the
                            local configuration file, you can enter the following:
                            Feature = {"abc" "xyz"}
                            When submitting a job that requires both of these products, you should enter
                            the following in your job command file:
                            requirements = (Feature == "abc") && (Feature == "xyz")

                            Note: You must define a feature on all machines that will be able to run
                                  dynamic simultaneous multithreading (SMT). SMT is only supported on
                                  POWER6 and POWER5 processor-based systems.
                            Example: When submitting a job that requires the SMT function, first specify
                            smt = yes in job command file (or select a class which had smt = yes defined).
                            Next, specify node_usage = not_shared and last, enter the following in the job
                            command file:
                            requirements = (Feature == "smt")




278   TWS LoadLeveler: Using and Administering
FLOATING_RESOURCES
       Specifies which consumable resources are available collectively on all of the
       machines in the LoadLeveler cluster. The count for each resource must be an
       integer greater than or equal to zero, and each resource can only be specified
       once in the list. Any resource specified for this keyword that is not already
       listed in the SCHEDULE_BY_RESOURCES keyword will not affect job
       scheduling. If any resource is specified incorrectly with the
       FLOATING_RESOURCES keyword, then all floating resources will be
       ignored. ConsumableCpus, ConsumableMemory,
|      ConsumableVirtualMemory, and ConsumableLargePageMemory may not be
       specified as floating resources.
        Syntax:
        FLOATING_RESOURCES = name(count) name(count) ... name(count)
        Default value: No default value is set.
    FS_INTERVAL
       Defines the number of minutes used as the interval for checking free file
       system space or inodes. If your file system receives many log messages or
       copies large executables to the LoadLeveler spool, the file system will fill up
       quicker and you should perform file size checking more frequently by setting
       the interval to a smaller value. LoadLeveler will not check the file system if the
       value of FS_INTERVAL is:
       v Set to zero
       v Set to a negative integer
        Syntax:
        FS_INTERVAL = minutes
        Default value: If FS_INTERVAL is not specified but any of the other
        file-system keywords (FS_NOTIFY, FS_SUSPEND, FS_TERMINATE,
        INODE_NOTIFY, INODE_SUSPEND, INODE_TERMINATE) are specified, the
        FS_INTERVAL value will default to 5 and the file system will be checked. If no
        file-system or inode keywords are set, LoadLeveler does not monitor file
        systems at all.
        For more information related to using this keyword, see “Setting up file system
        monitoring” on page 54.
    FS_NOTIFY
       Defines the lower and upper amounts, in bytes, of free file-system space at
       which LoadLeveler is to notify the administrator:
       v If the amount of free space becomes less than the lower threshold value,
         LoadLeveler sends a mail message to the administrator indicating that
         logging problems may occur.
       v When the amount of free space becomes greater than the upper threshold
         value, LoadLeveler sends a mail message to the administrator indicating that
         problem has been resolved.
        Syntax:
        FS_NOTIFY = lower threshold, upper threshold

        Specify space in bytes with the unit B. A metric prefix such as K, M, or G may
        precede the B. The valid range for both the lower and upper thresholds are -1B
        and all positive integers. If the value is set to -1, the transition across the
        threshold is not checked.
        Default value: In bytes: 1KB, -1B


                                                  Chapter 12. Configuration file reference   279
For more information related to using this keyword, see “Setting up file system
                            monitoring” on page 54.
                        FS_SUSPEND
                           Defines the lower and upper amounts, in bytes, of free file system space at
                           which LoadLeveler drains and resumes the Schedd and startd daemons
                           running on a node.
                           v If the amount of free space becomes less than the lower threshold value,
                             then LoadLeveler drains the Schedd and the startd daemons if they are
                             running on a node. When this happens, logging is turned off and mail
                             notification is sent to the administrator.
                           v When the amount of free space becomes greater than the upper threshold
                             value, LoadLeveler signals the Schedd and the startd daemons to resume.
                             When this happens, logging is turned on and mail notification is sent to the
                             administrator.
                            Syntax:
                            FS_SUSPEND = lower threshold, upper threshold

                            Specify space in bytes with the unit B. A metric prefix such as K, M, or G may
                            precede the B. The valid range for both the lower and upper thresholds are -1B
                            and all positive integers. If the value is set to -1, the transition across the
                            threshold is not checked.
                            Default value: In bytes: -1B, -1B
                            For more information related to using this keyword, see “Setting up file system
                            monitoring” on page 54.
                        FS_TERMINATE
                           Defines the lower and upper amounts, in bytes, of free file system space at
                           which LoadLeveler is terminated. This keyword sends the SIGTERM signal to
                           the Master daemon which then terminates all LoadLeveler daemons running
                           on the node.
                           v If the amount of free space becomes less than the lower threshold value, all
                             LoadLeveler daemons are terminated.
                           v An upper threshold value is required for this keyword. However, since
                             LoadLeveler has been terminated at the lower threshold, no action occurs.
                            Syntax:
                            FS_TERMINATE = lower threshold, upper threshold

                            Specify space in bytes with the unit B. A metric prefix such as K, M, or G may
                            precede the B. The valid range for the lower threshold is -1B and all positive
                            integers. If the value is set to -1, the transition across the threshold is not
                            checked.
                            Default value: In bytes: -1B, -1B
                            For more information related to using this keyword, see “Setting up file system
                            monitoring” on page 54.
                        GLOBAL_HISTORY
                          Identifies the directory that will contain the global history files produced by
                          llacctmrg command when no directory is specified as a command argument.
                            Syntax:
                            GLOBAL_HISTORY = directory
                            Default value: The default value is $(SPOOL) (the local spool directory).


280   TWS LoadLeveler: Using and Administering
For more information related to using this keyword, see “Collecting the
   accounting information and storing it into files” on page 66.
GSMONITOR
  Location of the gsmonitor executable (LoadL_GSmonitor).
   Restriction: This keyword is ignored by LoadLeveler for Linux.
   Syntax:
   GSMONITOR = directory
   Default value: $(BIN)/LoadL_GSmonitor
GSMONITOR_COREDUMP_DIR
  Local directory for storing LoadL_GSmonitor core dump files.
   Restriction: This keyword is ignored by LoadLeveler for Linux.
   Syntax:
   GSMONITOR_COREDUMP_DIR = directory
   Default value: The /tmp directory.
   For more information related to using this keyword, see “Specifying file and
   directory locations” on page 47.
GSMONITOR_DOMAIN
  Specifies the peer domain, on which the GSMONITOR daemon will execute.
   Restriction: This keyword is ignored by LoadLeveler for Linux.
   Syntax:
   GSMONITOR_DOMAIN = PEER
   Default value: No default value is set.
   For more information related to using this keyword, see “The gsmonitor
   daemon” on page 14.
GSMONITOR_RUNS_HERE
  Specifies whether the gsmonitor daemon will run on the host.
   Restriction: This keyword is ignored by LoadLeveler for Linux.
   Syntax:
   GSMONITOR_RUNS_HERE = TRUE | FALSE
   Default value: FALSE
   For more information related to using this keyword, see “The gsmonitor
   daemon” on page 14.
HISTORY
   Defines the path name where a file containing the history of local LoadLeveler
   jobs is kept.
   Syntax:
   HISTORY = directory
   Default value: $(SPOOL)/history
   For more information related to using this keyword, see “Collecting the
   accounting information and storing it into files” on page 66.
HISTORY_PERMISSION
   Specifies the owner, group, and world permissions of the history file associated
   with a LoadL_schedd daemon.

                                             Chapter 12. Configuration file reference   281
Syntax:
                            HISTORY_PERMISSION = permissions | rw-rw----

                            permissions must be a string with a length of nine characters and consisting of
                            the characters, r, w, x, or -.
                            Default value: The default settings are 660 (rw-rw----). LoadL_schedd will use
                            the default setting if the specified permission are less than rw-------.
                            Example: A specification such as HISTORY_PERMISSION = rw-rw-r-- will result
                            in permission settings of 664.
                        INODE_NOTIFY
                           Defines the lower and upper amounts, in inodes, of free file-system inodes at
                           which LoadLeveler is to notify the administrator:
                           v If the number of free inodes becomes less than the lower threshold value,
                             LoadLeveler sends a mail message to the administrator indicating that
                             logging problems may occur.
                           v When the number of free inodes becomes greater than the upper threshold
                             value, LoadLeveler sends a mail message to the administrator indicating that
                             problem has been resolved.
                            Syntax:
                            INODE_NOTIFY = lower threshold, upper threshold

                            The valid range for both the lower and upper thresholds are -1 and all positive
                            integers. If the value is set to -1, the transition across the threshold is not
                            checked.
                            Default value: In inodes: 1000, -1
                            For more information related to using this keyword, see “Setting up file system
                            monitoring” on page 54.
                        INODE_SUSPEND
                           Defines the lower and upper amounts, in inodes, of free file system inodes at
                           which LoadLeveler drains and resumes the Schedd and startd daemons
                           running on a node.
                           v If the number of free inodes becomes less than the lower threshold value,
                             then LoadLeveler drains the Schedd and the startd daemons if they are
                             running on a node. When this happens, logging is turned off and mail
                             notification is sent to the administrator.
                           v When the number of free inodes becomes greater than the upper threshold
                             value, LoadLeveler signals the Schedd and the startd daemons to resume.
                             When this happens, logging is turned on and mail notification is sent to the
                             administrator.
                            Syntax:
                            INODE_SUSPEND = lower threshold, upper threshold

                            The valid range for both the lower and upper thresholds are -1 and all positive
                            integers. If the value is set to -1, the transition across the threshold is not
                            checked.
                            Default value: In inodes: -1, -1
                            For more information related to using this keyword, see “Setting up file system
                            monitoring” on page 54.
                        INODE_TERMINATE
                           Defines the lower and upper amounts, in inodes, of free file system inodes at

282   TWS LoadLeveler: Using and Administering
which LoadLeveler is terminated. This keyword sends the SIGTERM signal to
   the Master daemon which then terminates all LoadLeveler daemons running
   on the node.
   v If the number of free inodes becomes less than the lower threshold value, all
     LoadLeveler daemons are terminated.
   v An upper threshold value is required for this keyword. However, since
     LoadLeveler has been terminated at the lower threshold, no action occurs.
   Syntax:
   INODE_TERMINATE = lower threshold, upper threshold

   The valid range for the lower threshold is -1 and all positive integers. If the
   value is set to -1, the transition across the threshold is not checked.
   Default value: In inodes: -1, -1
   For more information related to using this keyword, see “Setting up file system
   monitoring” on page 54.
JOB_ACCT_Q_POLICY
   Specifies the amount of time, in seconds, that determines how often the startd
   daemon updates the Schedd daemon with accounting data of running jobs.
   This controls the accuracy of the llq -x command.
   Syntax:
   JOB_ACCT_Q_POLICY = number
   Default value: 300 seconds
   For more information related to using this keyword, see “Gathering job
   accounting data” on page 61.
JOB_EPILOG
   Path name of the epilog program.
   Syntax:
   JOB_EPILOG = program name
   Default value: No default value is set.
   For more information related to using this keyword, see “Writing prolog and
   epilog programs” on page 77.
JOB_LIMIT_POLICY
   Specifies the amount of time, in seconds, that LoadLeveler checks to see if
   job_cpu_limit has been exceeded. The smaller of JOB_LIMIT_POLICY and
   JOB_ACCT_Q_POLICY is used to control how often the startd daemon
   collects resource consumption data on running jobs, and how often the
   job_cpu_limit is checked.
   Syntax:
   JOB_LIMIT_POLICY = number
   Default value: The default for JOB_LIMIT_POLICY is
   POLLING_FREQUENCY multiplied by POLLS_PER_UPDATE.
JOB_PROLOG
   Path name of the prolog program.
   Syntax:
   JOB_PROLOG = program name
   Default value: No default value is set.


                                              Chapter 12. Configuration file reference   283
For more information related to using this keyword, see “Writing prolog and
                              epilog programs” on page 77.
                        JOB_USER_EPILOG
                           Path name of the user epilog program.
                              Syntax:
                              JOB_USER_EPILOG = program name
                              Default value: No default value is set.
                              For more information related to using this keyword, see “Writing prolog and
                              epilog programs” on page 77.
                        JOB_USER_PROLOG
                           Path name of the user prolog program.
                              Syntax:
                              JOB_USER_PROLOG = program name
                              Default value: No default value is set.
                              For more information related to using this keyword, see “Writing prolog and
                              epilog programs” on page 77.
                        KBDD
                          Location of kbdd executable (LoadL_kbdd).
                              Syntax:
                              KBDD = directory
                              Default value: $(BIN)/LoadL_kbdd
                        KBDD_COREDUMP_DIR
                          Local directory for storing LoadL_kbdd daemon core dump files.
                              Syntax:
                              KBDD_COREDUMP_DIR = directory
                              Default value: The /tmp directory.
                              For more information related to using this keyword, see “Specifying file and
                              directory locations” on page 47.
                        KILL
                           Determines whether or not vacated jobs should be sent the SIGKILL signal and
                           replaced in the queue. It is used to remove a job that is taking too long to
                           vacate.
                              Syntax:
                              KILL: expression that evaluates to T or F (true or false)

                              When T, vacated LoadLeveler jobs are removed from the machine with no
                              attempt to take checkpoints.

                              For information about time-related variables that you may use for this
                              keyword, see “Variables to use for setting times” on page 320.
                        LIB
                              Defines the directory where LoadLeveler libraries are kept.
                              Syntax:
                              LIB = directory
                              Default value: $(RELEASEDIR)/lib

284   TWS LoadLeveler: Using and Administering
LL_RSH_COMMAND
   Specifies an administrator provided executable to be used by llctl start when
   starting LoadLeveler on remote machines in the administration file. The
   LL_RSH_COMMAND keyword is any executable that can be used as a
   substitute for /usr/bin/rsh. The llctl start command passes arguments to the
   executable specified by LL_RSH_COMMAND in the following format:
   LL_RSH_COMMAND hostname -n llctl start options

   Syntax:
   LL_RSH_COMMAND = full_path_to_executable
   Default value: /usr/bin/rsh. This keyword must specify the full path name to
   the executable provided. If no value is specified, LoadLeveler will use
   /usr/bin/rsh as the default when issuing a start. If an error occurred while
   locating the executable specified, an error message will be displayed.
   Example: This example shows that using the secure shell (/usr/bin/ssh) is the
   preferred method for the llctl start command to communicate with remote
   nodes. Specify the following in the configuration file:
   LL_RSH_COMMAND=/usr/bin/ssh
LOADL_ADMIN
  Specifies a list of LoadLeveler administrators.
   Syntax:
   LOADL_ADMIN = list of user names

   Where list of user names is a blank-delimited list of those individuals who will
   have administrative authority. These users are able to invoke the
   administrator-only commands such as llctl, llfavorjob, and llfavoruser. These
   administrators can also invoke the administrator-only GUI functions. For more
   information, see Chapter 7, “Using LoadLeveler’s GUI to perform
   administrator tasks,” on page 169.
   Default value: No default value is set, which means no one has administrator
   authority until this keyword is defined with one or more user names.
   Example: To grant administrative authority to users bob and mary, enter the
   following in the configuration file:
   LOADL_ADMIN = bob mary
   For more information related to using this keyword, see “Defining LoadLeveler
   administrators” on page 43.
LOCAL_CONFIG
  Specifies the path name of the optional local configuration file containing
  information specific to a node in the LoadLeveler network.
   Syntax:
   LOCAL_CONFIG = directory
   Default value: No default value is set.
   Examples:
   v If you are using a distributed file system like NFS, some examples are:
      LOCAL_CONFIG = $(tilde)/$(host).LoadL_config.local
      LOCAL_CONFIG = $(tilde)/LoadL_config.$(host).$(domain)
      LOCAL_CONFIG = $(tilde)/LoadL_config.local.$(hostname)




                                              Chapter 12. Configuration file reference   285
See “LoadLeveler variables” on page 314 for information about the tilde,
                                  host, and domain variables.
                                v If you are using a local file system, an example is:
                                   LOCAL_CONFIG = /var/LoadL/LoadL_config.local
                            LOG
                              Defines the local directory to store log files. It is not necessary to keep all the
                              log files created by the various LoadLeveler daemons and programs in one
                              directory, but you will probably find it convenient to do so.
                                Syntax:
                                LOG = local directory/log
                                Default value: $(tilde)/log
                            LOG_MESSAGE_THRESHOLD
                              Specifies the maximum amount of memory, in bytes, for the message queue.
                              Messages in the queue are waiting to be written to the log file. When the
                              message logging thread cannot write messages to the log file as fast as they
                              arrive, the memory consumed by the message queue can exceed the threshold.
                              In this case, LoadLeveler will curtail logging by turning off all debug flags
                              except D_ALWAYS, therefore, reducing the amount of logging that takes place.
                              If the threshold is exceeded by the curtailed message queue, message logging
                              is stopped. Special log messages are written to the log file, which indicate that
                              some messages are missing. Mail is also sent to the administrator indicating
                              that messages are missing. A value of -1 for this keyword will turn off the
                              buffer threshold meaning that the threshold is unlimited.
                                Syntax:
                                LOG_MESSAGE_THRESHOLD = bytes
                                Default value: 20*1024*1024 (bytes)
                            MACHINE_AUTHENTICATE
                              Specifies whether machine validation is performed. When set to true,
                              LoadLeveler only accepts connections from machines specified in the
                              administration file. When set to false, LoadLeveler accepts connections from
                              any machine.
                                When set to true, every communication between LoadLeveler processes will
                                verify that the sending process is running on a machine which is identified via
                                a machine stanza in the administration file. The validation is done by
                                capturing the address of the sending machine when the accept function call is
                                issued to accept a connection. The gethostbyaddr function is called to translate
                                the address to a name, and the name is matched with the list derived from the
                                administration file.

|                               Note: You must not set the MACHINE_AUTHENTICATE keyword to true for
|                                     a cluster which is configured to be a main scale-across cluster. The main
|                                     scale-across cluster must permit communication with LoadLeveler
|                                     daemons running on any machine in any cluster participating in the
|                                     scale-across multicluster environment.
                                Syntax:
                                MACHINE_AUTHENTICATE = true | false
                                Default value: false
                                For more information related to using this keyword, see “Defining a
                                LoadLeveler cluster” on page 44.


    286   TWS LoadLeveler: Using and Administering
MACHINE_UPDATE_INTERVAL
      Specifies the time, in seconds, during which machines must report to the
      central manager.
       Syntax:
       MACHINE_UPDATE_INTERVAL = number

       Where number specifies the time period, in seconds, during which machines
       must report to the central manager. Machines that do not report in this number
       of seconds are considered down. number must be a numerical value and cannot
       be an arithmetic expression.
       Default value: The default is 300 seconds.
       For more information related to using this keyword, see “Setting negotiator
       characteristics and policies” on page 45.
    MACHPRIO
      Machine priority expression.
       Syntax:
       MACHPRIO = expression

       You can use the following LoadLeveler variables in the MACHPRIO
       expression:
       v LoadAvg
       v Connectivity
       v Cpus
       v Speed
       v Memory
       v VirtualMemory
       v Disk
       v CustomMetric
       v MasterMachPriority
       v ConsumableCpus
       v ConsumableMemory
       v ConsumableVirtualMemory
|      v ConsumableLargePageMemory
       v PagesFreed
       v PagesScanned
       v FreeRealMemory
       For detailed descriptions of these variables, see “LoadLeveler variables” on
       page 314.
       Default value: (0 - LoadAvg)
       Examples:
       v Example 1
         This example orders machines by the Berkeley one-minute load average.
          MACHPRIO : 0 - (LoadAvg)

          Therefore, if LoadAvg equals .7, this example would read:
          MACHPRIO : 0 - (.7)

          The MACHPRIO would evaluate to -.7.
       v Example 2



                                                 Chapter 12. Configuration file reference   287
This example orders machines by the Berkeley one-minute load average
                               normalized for machine speed:
                               MACHPRIO : 0 - (1000 * (LoadAvg / (Cpus * Speed)))

                               Therefore, if LoadAvg equals .7, Cpus equals 1, and Speed equals 2, this
                               example would read:
                               MACHPRIO : 0 - (1000 * (.7 / (1 * 2)))

                               This example further evaluates to:
                               MACHPRIO : 0 - (350)

                               The MACHPRIO would evaluate to -350.
                               Notice that if the speed of the machine were increased to 3, the equation
                               would read:
                               MACHPRIO : 0 - (1000 * (.7 / (1 * 3)))

                              The MACHPRIO would evaluate to approximately -233. Therefore, as the
                              speed of the machine increases, the MACHPRIO also increases.
                            v Example 3
                              This example orders machines accounting for real memory and available
                              swap space (remembering that Memory is in Mbytes and VirtualMemory is
                              in Kbytes):
                               MACHPRIO : 0 - (10000 * (LoadAvg / (Cpus * Speed))) +
                               (10 * Memory) + (VirtualMemory / 1000)
                            v Example 4
                              This example sets a relative machine priority based on the value of the
                              CUSTOM_METRIC keyword.
                               MACHPRIO : CustomMetric
                               To do this, you must specify a value for the CUSTOM_METRIC keyword or
                               the CUSTOM_METRIC_COMMAND keyword in either the
                               LoadL_config.local file of a machine or in the global LoadL_config file. To
                               assign the same relative priority to all machines, specify the
                               CUSTOM_METRIC keyword in the global configuration file. For example:
                               CUSTOM_METRIC = 5
                              You can override this value for an individual machine by specifying a
                              different value in that machine’s LoadL_config.local file.
                            v Example 5
                              This example gives master nodes the highest priority:
                               MACHPRIO : (MasterMachPriority * 10000)
                            v Example 6
                              This example gives nodes the with highest percentage of switch adapters
                              with connectivity the highest priority:
                               MACHPRIO : Connectivity
                            For more information related to using this keyword, see “Setting negotiator
                            characteristics and policies” on page 45.
                        MAIL
                          Name of a local mail program used to override default mail notification.
                            Syntax:
                            MAIL = program name
                            Default value: No default value is set.

288   TWS LoadLeveler: Using and Administering
For more information related to using this keyword, see “Using your own mail
   program” on page 81.
MASTER
  Location of the master executable (LoadL_master).
   Syntax:
   MASTER = directory
   Default value: $(BIN)/LoadL_master
   For more information related to using this keyword, see “How LoadLeveler
   daemons process jobs” on page 8.
MASTER_COREDUMP_DIR
  Local directory for storing LoadL_master core dump files.
   Syntax:
   MASTER_COREDUMP_DIR = directory
   Default value: The /tmp directory.
   For more information related to using this keyword, see “Specifying file and
   directory locations” on page 47.
MASTER_DGRAM_PORT
  The port number used when connecting to the daemon.
   Syntax:
   MASTER_DGRAM_PORT = port number
   Default value: The default is 9617.
   For more information related to using this keyword, see “Defining network
   characteristics” on page 47.
MASTER_STREAM_PORT
  Specifies the port number to be used when connecting to the daemon.
   Syntax:
   MASTER_STREAM_PORT = port number
   Default value: The default is 9616.
   For more information related to using this keyword, see “Defining network
   characteristics” on page 47.
MAX_CKPT_INTERVAL
  The maximum number of seconds between checkpoints for running jobs.
   Syntax:
   MAX_CKPT_INTERVAL = number
   Default value: 7200 (2 hours)
   For more information related to using this keyword, see “LoadLeveler support
   for checkpointing jobs” on page 139.
MAX_JOB_REJECT
  Determines the number of times a job is rejected before it is canceled or put in
  User Hold or System Hold status.
   Syntax:
   MAX_JOB_REJECT = number




                                             Chapter 12. Configuration file reference   289
number must be a numerical value and cannot be an arithmetic expression.
                                MAX_JOB_REJECT may be set to unlimited rejects by specifying a value of –1.
                                Default value: The default value is 0, which indicates a rejected job will
                                immediately be canceled or placed on hold.
                                For related information, see the NEGOTIATOR_REJECT_DEFER keyword.
                            MAX_RESERVATIONS
                              Specifies the maximum number of reservations that this LoadLeveler cluster
                              can have. Only reservations in waiting and in use are counted toward this
                              limit; LoadLeveler does not count reservations that have already ended or are
                              in the process of being canceled.

                                Notes:
                                          1. Having too many reservations in a LoadLeveler cluster can have
                                             performance impacts. Administrators should select a suitable value
                                             for this keyword.
|                                         2. A recurring reservation only counts as one reservation towards the
|                                            MAX_RESERVATIONS limit regardless of the number of times that
|                                            the reservation recurs.
                                Syntax:
                                MAX_RESERVATIONS = number
                                The value for this keyword can be 0 or a positive integer.
                                Default value: The default is 10.
                            MAX_STARTERS
                              Specifies the maximum number of tasks that can run simultaneously on a
                              machine. In this case, a task can be a serial job step or a parallel task.
                              MAX_STARTERS defines the number of initiators on the machine (the number
                              of tasks that can be initiated from a startd).
                                Syntax:
                                MAX_STARTERS = number
                                Default value: If this keyword is not specified, the default is the number of
                                elements in the Class statement.
                                For more information related to using this keyword, see “Specifying how many
                                jobs a machine can run” on page 55.
                            MAX_TOP_DOGS
                              Specifies the maximum total number of top dogs that the central manager
                              daemon will allocate. When scheduling jobs, after MAX_TOP_DOGS total top
                              dogs have been allocated, no more will be considered.
                                Syntax:
                                MAX_TOP_DOGS = k | 1

                                where: k is a non-negative integer specifying the global maximum top dogs
                                limit.
                                Default value: The default value is 1.
                                For more information related to using this keyword, see “Using the BACKFILL
                                scheduler” on page 110.
                            MIN_CKPT_INTERVAL
                              The minimum number of seconds between checkpoints for running jobs.


    290   TWS LoadLeveler: Using and Administering
Syntax:
   MIN_CKPT_INTERVAL = number
   Default value: 900 (15 minutes)
   For more information related to using this keyword, see “LoadLeveler support
   for checkpointing jobs” on page 139.
NEGOTIATOR
  Location of the negotiator executable (LoadL_negotiator).
   Syntax:
   NEGOTIATOR = directory
   Default value: $(BIN)/LoadL_negotiator
   For more information related to using this keyword, see “How LoadLeveler
   daemons process jobs” on page 8.
NEGOTIATOR_COREDUMP_DIR
  Local directory for storing LoadL_negotiator core dump files.
   Syntax:
   NEGOTIATOR_COREDUMP_DIR = directory
   Default value: The /tmp directory.
   For more information related to using this keyword, see “Specifying file and
   directory locations” on page 47.
NEGOTIATOR_CYCLE_DELAY
  Specifies the minimum time, in seconds, the negotiator delays between periods
  when it attempts to schedule jobs. This time is used by the negotiator daemon
  to respond to queries, reorder job queues, collect information about changes in
  the states of jobs, and so on. Delaying the scheduling of jobs might improve
  the overall performance of the negotiator by preventing it from spending
  excessive time attempting to schedule jobs.
   Syntax:
   NEGOTIATOR_CYCLE_DELAY = number

   number must be a numerical value and cannot be an arithmetic expression.
   Default value: The default is 0 seconds
NEGOTIATOR_CYCLE_TIME_LIMIT
  Specifies the maximum amount of time, in seconds, that LoadLeveler will
  allow the negotiator to spend in one cycle trying to schedule jobs. The
  negotiator cycle will end, after the specified number of seconds, even if there
  are additional jobs waiting for dispatch. Jobs waiting for dispatch will be
  considered at the next negotiator cycle. The
  NEGOTIATOR_CYCLE_TIME_LIMIT keyword applies only to the BACKFILL
  scheduler.
   Syntax:
   NEGOTIATOR_CYCLE_TIME_LIMIT = number

   Where number must be a positive integer or zero and cannot be an arithmetic
   expression.
   Default value: If the keyword value is not specified or a value of zero is used,
   the negotiator cycle will be unlimited.


                                             Chapter 12. Configuration file reference   291
NEGOTIATOR_INTERVAL
                          The time interval, in seconds, at which the negotiator daemon updates the
                          status of jobs in the LoadLeveler cluster and negotiates with machines that are
                          available to run jobs.
                            Syntax:
                            NEGOTIATOR_INTERVAL = number

                            Where number specifies the interval, in seconds, at which the negotiator
                            daemon performs a “negotiation loop” during which it attempts to assign
                            available machines to waiting jobs. A negotiation loop also occurs whenever
                            job states or machine states change. number must be a numerical value and
                            cannot be an arithmetic expression.
                            When this keyword is set to zero, the central manager’s automatic scheduling
                            activity is been disabled, and LoadLeveler will not attempt to schedule any
                            jobs unless instructed to do so through the llrunscheduler command or
                            ll_run_scheduler subroutine.
                            Default value: The default is 30 seconds.
                            For more information related to using this keyword, see “Controlling the
                            central manager scheduling cycle” on page 73.
                        NEGOTIATOR_LOADAVG_INCREMENT
                          Specifies the value the negotiator adds to the startd machine’s load average
                          whenever a job in the Pending state is queued on that machine. This value is
                          used to compensate for the increased load caused by starting another job.
                            Syntax:
                            NEGOTIATOR_LOADAVG_INCREMENT = number

                            number must be a numerical value and cannot be an arithmetic expression.
                            Default value: The default value is .5
                        NEGOTIATOR_PARALLEL_DEFER
                          Specifies the amount of time, in seconds, that defines how long a job stays out
                          of the queue after it fails to get the correct number of processors. This keyword
                          applies only to the default LoadLeveler scheduler. This keyword must be
                          greater than the NEGOTIATOR_INTERVAL. value; if it is not, the default is
                          used.
                            Syntax:
                            NEGOTIATOR_PARALLEL_DEFER = number

                            number must be a numerical value and cannot be an arithmetic expression.
                            Default value: The default is NEGOTIATOR_INTERVAL multiplied by 5.
                        NEGOTIATOR_PARALLEL_HOLD
                          Specifies the amount of time, in seconds, that defines how long a job is given
                          to accumulate processors. This keyword applies only to the default
                          LoadLeveler scheduler. This keyword must be greater than the
                          NEGOTIATOR_INTERVAL value; if it is not, the default is used.
                            Syntax:
                            NEGOTIATOR_PARALLEL_HOLD = number

                            number must be a numerical value and cannot be an arithmetic expression.
                            Default value: The default is NEGOTIATOR_INTERVAL multiplied by 5.

292   TWS LoadLeveler: Using and Administering
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL
  Specifies the amount of time, in seconds, between calculation of the SYSPRIO
  values for waiting jobs. Recalculating the priority can be CPU-intensive;
  specifying low values for the
  NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword may lead to
  a heavy CPU load on the negotiator if a large number of jobs are running or
  waiting for resources. A value of 0 means the SYSPRIO values are not
  recalculated.
   You can use this keyword to base the order in which jobs are run on the
   current number of running, queued, or total jobs for a user or a group.
   Syntax:
   NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = number

   number must be a numerical value and cannot be an arithmetic expression.
   Default value: The default is 120 seconds.
NEGOTIATOR_REJECT_DEFER
  Specifies the amount of time in seconds the negotiator waits before it considers
  scheduling a job to a machine that recently rejected the job.
   Syntax:
   NEGOTIATOR_REJECT_DEFER = number

   number must be a numerical value and cannot be an arithmetic expression.
   Default value: The default is 120 seconds.
   For related information, see the MAX_JOB_REJECT keyword.
NEGOTIATOR_REMOVE_COMPLETED
  Specifies the amount of time, in seconds, that you want the negotiator to keep
  information regarding completed and removed jobs so that you can query this
  information using the llq command.
   Syntax:
   NEGOTIATOR_REMOVE_COMPLETED = number

   number must be a numerical value and cannot be an arithmetic expression.
   Default value: The default is 0 seconds.
NEGOTIATOR_RESCAN_QUEUE
  specifies the amount of time in seconds that defines how long the negotiator
  waits to rescan the job queue for machines which have bypassed jobs which
  could not run due to conditions which may change over time. This keyword
  must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the
  default is used.
   Syntax:
   NEGOTIATOR_RESCAN_QUEUE = number

   number must be a numerical value and cannot be an arithmetic expression.
   Default value: The default is 900 seconds.
NEGOTIATOR_STREAM_PORT
  Specifies the port number used when connecting to the daemon.
   Syntax:
   NEGOTIATOR_STREAM_PORT = port number

                                              Chapter 12. Configuration file reference   293
Default value: The default is 9614.
                            For more information related to using this keyword, see “Defining network
                            characteristics” on page 47.
                        OBITUARY_LOG_LENGTH
                           Specifies the number of lines from the end of the file that are appended to the
                           mail message. The master daemon mails this log to the LoadLeveler
                           administrators when one of the daemons dies.
                            Syntax:
                            OBITUARY_LOG_LENGTH = number

                            number must be a numerical value and cannot be an arithmetic expression.
                            Default value: The default is 25.
                        POLLING_FREQUENCY
                           Specifies the interval, in seconds, with which the startd daemon evaluates the
                           load on the local machine and decides whether to suspend, resume, or abort
                           jobs. This time is also the minimum interval at which the kbdd daemon reports
                           keyboard or mouse activity to the startd daemon.
                            Syntax:
                            POLLING_FREQUENCY = number

                            number must be a numerical value and cannot be an arithmetic expression.
                            Default value: The default is 5.
                        POLLS_PER_UPDATE
                           Specifies how often, in POLLING_FREQUENCY intervals, startd daemon
                           updates the central manager. Due to the communication overhead, it is
                           impractical to do this with the frequency defined by the
                           POLLING_FREQUENCY keyword. Therefore, the startd daemon only updates
                           the central manager every nth (where n is the number specified for
                           POLLS_PER_UPDATE) local update. Change POLLS_PER_UPDATE when
                           changing the POLLING_FREQUENCY.
                            Syntax:
                            POLLS_PER_UPDATE = number

                            number must be a numerical value and cannot be an arithmetic expression.
                            Default value: The default is 24.
                        PRESTARTED_STARTERS
                           Specifies how many prestarted starter processes LoadLeveler will maintain on
                           an execution node to manage jobs when they arrive. The startd daemon starts
                           the number of starter processes specified by this keyword. You may specify
                           this keyword in either the global or local configuration file.
                            Syntax:
                            PRESTARTED_STARTERS = number

                            number must be less than or equal to the value specified through the
                            MAX_STARTERS keyword. If the value of PRESTARTED_STARTERS specified
                            is greater then MAX_STARTERS, LoadLeveler records a warning message in
                            the startd log and assigns PRESTARTED_STARTERS the same value as
                            MAX_STARTERS.



294   TWS LoadLeveler: Using and Administering
If the value PRESTARTED_STARTERS is zero, no starter processes will be
   started before jobs arrive on the execution node.
   Default value: The default is 1.
PREEMPT_CLASS
   Defines the preemption rule for a job class.
   Syntax: The following forms illustrate correct syntax.
   PREEMPT_CLASS[incoming_class] = ALL[:preempt_method] { outgoing_class1
   [outgoing_class2 ...] }
          Using this form, ALL indicates that job steps of incoming_class have
          priority and will not share nodes with job steps of outgoing_class1,
          outgoing_class2, or other outgoing classes. If a job step of the
          incoming_class is to be started on a set of nodes, all job steps of
          outgoing_class1, outgoing_class2, or other outgoing classes running on
          those nodes will be preempted.

           Note: The ALL preemption rule does not apply to Blue Gene jobs.
   PREEMPT_CLASS[incoming_class] = ENOUGH[:preempt_method] {
   outgoing_class1 [outgoing_class2 ...] }
          Using this form, ENOUGH indicates that job steps of incoming_class
          will share nodes with job steps of outgoing_class1, outgoing_class2, or
          other outgoing classes if there are sufficient resources. If a job step of
          the incoming_class is to be started on a set of nodes, one or more job
          steps of outgoing_class1, outgoing_class2, or other outgoing classes
          running on those nodes may be preempted to get needed resources.

   Combinations of these forms are also allowed.

   Note:
           1. The optional specification preempt_method indicates which method
              LoadLeveler is to use to preempt the jobs; this specification is valid
              only for the BACKFILL scheduler. Valid values for this specification
              in keyword syntax are the highlighted abbreviations in parentheses:
              v Remove (rm)
              v System hold (sh)
              v Suspend (su)
              v Vacate (vc)
              v User hold (uh)
                For more information about preemption methods, see “Steps for
                configuring a scheduler to preempt jobs” on page 130.
           2.   Using the ″ALL″ value in the PREEMPT_CLASS keyword places
                implied restrictions on when a job can start. See “Planning to
                preempt jobs” on page 128 for more information.
           3.   The incoming class is designated inside [ ] brackets.
           4.   Outgoing classes are designated inside { } curly braces.
           5.   The job classes on the right hand (outgoing) side of the statement
                must be different from incoming class, or it may be allclasses. If the
                outgoing side is defined as allclasses then all job classes are
                preemptable with the exception of the incoming class specified
                within brackets.




                                               Chapter 12. Configuration file reference   295
6. A class name or allclasses should not be in both the ALL list and
                                         the ENOUGH list. If you do so, the entire statement will be
                                         ignored. An example of this is:
                                         PREEMPT_CLASS[Class_A]=ALL{allclasses} ENOUGH {allclasses}
                                      7. If you use allclasses as an outgoing (preemptable) class, then no
                                         other class names should be listed at the right hand side as the
                                         entire statement will be ignored. An example of this is:
                                         PREEMPT_CLASS[Class_A]=ALL{Class_B} ENOUGH {allclasses}
                                      8. More than one ALL statement and more than one ENOUGH
                                         statement may appear at the right hand side. Multiple statements
                                         have a cumulative effect.
                                      9. Each ALL or ENOUGH statement can have multiple class names
                                         inside the curly braces. However, a blank space delimiter is
                                         required between each class name.
                                    10. Both the ALL and ENOUGH statements can include an optional
                                        specification indicating the method LoadLeveler will use to
                                        preempt the jobs. Valid values for this specification are listed in the
                                        description of the DEFAULT_PREEMPT_METHOD keyword. If a
                                        value is specified on the PREEMPT_CLASS ALL or ENOUGH
                                        statement, that value overrides the value set on the
                                        DEFAULT_PREEMPT_METHOD keyword, if any.
                                    11. ALL and ENOUGH may be in mixed cases.
                                    12. Spaces are allowed around the brackets and curly braces.
                                    13. PREEMPT_CLASS [allclasses] will be ignored.
                            Default value: No default value is set.
                            Examples:
                            PREEMPT_CLASS[Class_B]=ALL{Class_E Class_D} ENOUGH {Class_C}
                                   This indicates that all Class_E jobs and all Class_D jobs and enough
                                   Class_C jobs will be preempted to enable an incoming Class_B job to
                                   run.
                            PREEMPT_CLASS[Class_D]=ENOUGH:VC {Class_E}
                                   This indicates that zero, one, or more Class_E jobs will be preempted
                                   using the vacate method to enable an incoming Class_D job to run.
                        PREEMPTION_SUPPORT
                           For the BACKFILL or API schedulers only, specifies the level of preemption
                           support for a cluster.
                            Syntax:
                            PREEMPTION_SUPPORT= full | no_adapter | none
                            v When set to full, preemption is fully supported.
                            v When set to no_adapter, preemption is supported but the adapter resources
                              are not released by preemption.
                            v When set to none, preemption is not supported, and preemption requests
                              will be rejected.

                            Note:
                                    1. If the value of this keyword is set to any value other than none for
                                       the default scheduler, LoadLeveler will not start.




296   TWS LoadLeveler: Using and Administering
2. For the BACKFILL or API scheduler, when this keyword is set to full
             or no_adapter and preemption by the suspend method is required,
             the configuration keyword PROCESS_TRACKING must be set to
             true.
   Default value: The default value for all schedulers is none; if you want to
   enable preemption under these schedulers, you must set a value for this
   keyword.
PROCESS_TRACKING
   Specifies whether or not LoadLeveler will cancel any processes (throughout the
   entire cluster), left behind when a job terminates.
   Syntax:
   PROCESS_TRACKING = TRUE | FALSE

   When TRUE ensures that when a job is terminated, no processes created by
   the job will continue running.

   Note: It is necessary to set this keyword to true to do preemption by the
         suspend method with the BACKFILL or API scheduler.
   Default value: FALSE
PROCESS_TRACKING_EXTENSION
   Specifies the directory containing the kernel module LoadL_pt_ke (AIX) or
   proctrk.ko (Linux).
   Syntax:
   PROCESS_TRACKING_EXTENSION = directory
   Default value: The directory $HOME/bin
   For more information related to using this keyword, see “Tracking job
   processes” on page 70.
PUBLISH_OBITUARIES
   Specifies whether or not the master daemon sends mail to the administrator
   when any daemon it manages ends abnormally. When set to true, this keyword
   specifies that the master daemon sends mail to the administrators identified by
   LOADL_ADMIN keyword.
   Syntax:
   PUBLISH_OBITUARIES = true | false
   Default value: true
REJECT_ON_RESTRICTED_LOGIN
   Specifies whether the user’s account status will be checked on every node
   where the job will be run by calling the AIX loginrestrictions function with the
   S_DIST_CLNT flag.
   Restriction: Login restriction checking is ignored by LoadLeveler for Linux.
   Login restriction checking includes:
   v Does the account still exist?
   v Is the account locked?
   v Has the account expired?
   v Do failed login attempts exceed the limit for this account?
   v Is login disabled via /etc/nologin?




                                             Chapter 12. Configuration file reference   297
If the AIX loginrestrictions function indicates a failure then the user’s job will
                                be rejected and will be processed according to the LoadLeveler configuration
                                parameters MAX_JOB_REJECT and ACTION_ON_MAX_REJECT.
                                Syntax:
                                REJECT_ON_RESTRICTED_LOGIN = true | false
                                Default value: false
                            RELEASEDIR
                               Defines the directory where all the LoadLeveler software resides.
                                Syntax:
                                RELEASEDIR = release directory
                                Default value: $(RELEASEDIR)
                            RESERVATION_CAN_BE_EXCEEDED
                               Specifies whether LoadLeveler will schedule job steps that are bound to a
                               reservation when their end times (based on hard wall-clock limits) exceed the
                               reservation end time.
                                Syntax:
                                RESERVATION_CAN_BE_EXCEEDED = true | false
                                When this keyword is set to false, LoadLeveler schedules only those job steps
                                that will complete before the reservation ends. When set to true, LoadLeveler
                                schedules job steps to run under a reservation even if their end times are
                                expected to exceed the reservation end time. When the reservation ends,
                                however, the reserved nodes no longer belong to the reservation, and so these
                                nodes might not be available for the jobs to continue running. In this case,
                                LoadLeveler might preempt the running jobs.
                                Note that this keyword setting does not change the actual end time of the
                                reservation. It only affects how LoadLeveler manages job steps whose end
                                times exceed the end time of the reservation.
                                Default value: true
                            RESERVATION_HISTORY
                               Defines the name of a file that is to contain the local history of reservations.
                                Syntax:
                                RESERVATION_HISTORY = file name
|                               LoadLeveler appends a single line to the reservation history file for each
|                               completed occurrence of each reservation. For an example, see “Collecting
|                               accounting data for reservations” on page 63.
                                Default value: $(SPOOL)/reservation_history
                            RESERVATION_MIN_ADVANCE_TIME
                               Specifies the minimum time, in minutes, between the time at which a
                               reservation is created and the time at which the reservation is to start.
                                Syntax:
                                RESERVATION_MIN_ADVANCE_TIME = number of minutes

                                By default, the earliest time at which a reservation may start is the current time
                                plus the value set for the RESERVATION_SETUP_TIME keyword.
                                Default value: 0 (zero)



    298   TWS LoadLeveler: Using and Administering
RESERVATION_PRIORITY
   Specifies whether LoadLeveler administrators may reserve nodes on which
   running jobs are expected to end after the reservation start time. This keyword
   value applies only for LoadLeveler administrators; other reservation owners do
   not have this capability.
   Syntax:
   RESERVATION_PRIORITY = NONE | HIGH
   When you set this keyword to HIGH, before activating the reservation,
   LoadLeveler preempts the job steps running on the reserved nodes (Blue Gene
   job steps are handled the same way). The only exceptions are non-preemptable
   jobs; LoadLeveler will not preempt those jobs because of any reservations.
   Default value: NONE
RESERVATION_SETUP_TIME
   Specifies how much time, in seconds, that LoadLeveler may use to prepare for
   a reservation before it is to start. The tasks that LoadLeveler performs during
   this time include checking and reporting node conditions, and preempting job
   steps still running on the reserved nodes.
   For a given reservation, LoadLeveler uses the RESERVATION_SETUP_TIME
   keyword value that is set at the time that the reservation is created, not
   whatever value might be set when the reservation starts. If the start time of the
   reservation is modified, however, LoadLeveler uses the
   RESERVATION_SETUP_TIME keyword value that is set at the time of the
   modification.
   Syntax:
   RESERVATION_SETUP_TIME = number of seconds
   Default value: 60
RESTARTS_PER_HOUR
   Specifies how many times the master daemon attempts to restart a daemon
   that dies abnormally. Because one or more of the daemons may be unable to
   run due to a permanent error, the master only attempts
   $(RESTARTS_PER_HOUR) restarts within a 60 minute period. Failing that, it
   sends mail to the administrators identified by the LOADL_ADMIN keyword
   and exits.
   Syntax:
   RESTARTS_PER_HOUR = number

   number must be a numerical value and cannot be an arithmetic expression.
   Default value: The default is 12.
RESUME_ON_SWITCH_TABLE_ERROR_CLEAR
   Specifies whether or not the startd that was drained when the switch table
   failed to unload will automatically resume once the unload errors are cleared.
   The unload error is considered cleared after LoadLeveler can successfully
   unload the switch table. For this keyword to work, the
   DRAIN_ON_SWITCH_TABLE_ERROR option in the configuration file must
   be turned on and not disabled. Flushing, suspending, or draining of a startd
   manually or automatically will disable this option until the startd is manually
   resumed.
   Syntax:
   RESUME_ON_SWITCH_TABLE_ERROR_CLEAR = true | false


                                                Chapter 12. Configuration file reference   299
Default value: false
                            RSET_SUPPORT
                               Indicates the level of RSet support present on a machine.
                                Syntax:
                                RSET_SUPPORT = option

                                The available options are:
                                RSET_MCM_AFFINITY
                                     Indicates that the machine can run jobs requesting MCM (memory or
                                     adapter) and processor (cache or SMT) affinity.
                                RSET_NONE
                                      Indicates that LoadLeveler RSet support is not available on the
                                      machine.
                                RSET_USER_DEFINED
                                      Indicates that the machine can be used for jobs with a user-created
                                      RSet in their job command file.
                                Default value: RSET_NONE
                            SAVELOGS
                               Specifies the directory in which log files are archived.
                                Syntax:
                                SAVELOGS = directory

                                Where directory is the directory in which log files will be archived.
                                Default value: No default value is set.
                                For more information related to using this keyword, see “Configuring
                                recording activity and log files” on page 48.
                            SAVELOGS_COMPRESS_PROGRAM
                               Compresses logs after they are copied to the SAVELOGS directory. If not
                               specified, SAVELOGS are copied, but are not compressed.
                                Syntax:
                                SAVELOGS_COMPRESS_PROGRAM = program

                                Where program is a specific executable program. It can be a system-provided
                                facility (such as, /bin/gzip) or an administrator-provided executable program.
                                The value must be a full path name and can contain command-line arguments.
                                LoadLeveler will call the program as: program filename.
                                Default value: If blank, the logs are not compressed.
                                Example: In this example, LoadLeveler will run the gzip -f command. The log
                                file in SAVELOGS will be compressed after it is copied to SAVELOGS. If the
                                program cannot be found or is not executable, LoadLeveler will log the error
                                and SAVELOGS will remain uncompressed.
                                SAVELOGS_COMPRESS_PROGRAM = /bin/gzip -f
|                           SCALE_ACROSS_SCHEDULING_TIMEOUT
|                              Defines the amount of time a central manager will wait:
|                              v For the main cluster central manager, this value defines the wait time for
|                                responses from the non-main cluster central managers when it is scheduling
|                                scale-across jobs.


    300   TWS LoadLeveler: Using and Administering
|      v For the non-main cluster central managers, this value limits how long the
|        central manager on each non-main cluster will hold resources for a
|        scale-across job step while waiting for an order to start the job.
|      Syntax:
|      scale_across_scheduling_timeout = number
|      Default value: 300 seconds
    SCHEDD
       Location of the Schedd executable (LoadL_schedd).
       Syntax:
       SCHEDD = directory
       Default value: $(BIN)/LoadL_schedd
       For more information related to using this keyword, see “How LoadLeveler
       daemons process jobs” on page 8.
    SCHEDD_COREDUMP_DIR
       Specifies the local directory for storing LoadL_schedd core dump files.
       Syntax:
       SCHEDD_COREDUMP_DIR = directory
       Default value: The /tmp directory.
       For more information related to using this keyword, see “Specifying file and
       directory locations” on page 47.
    SCHEDD_INTERVAL
       Specifies the interval, in seconds, at which the Schedd daemon checks the local
       job queue and updates the negotiator daemon.
       Syntax:
       SCHEDD_INTERVAL = number

       number must be a numerical value and cannot be an arithmetic expression.
       Default value: The default is 60 seconds.
    SCHEDD_RUNS_HERE
       Specifies whether the Schedd daemon runs on the host. If you do not want to
       run the Schedd daemon, specify false.
       This keyword does not designate a machine as a public scheduling machine.
       Unless configured as a public scheduling machine, a machine configured to
       run the Schedd daemon will only accept job submissions from the same
       machine running the Schedd daemon. A public scheduling machine accepts job
       submissions from other machines in the LoadLeveler cluster. To configure a
       machine as a public scheduling machine, see the schedd_host keyword
       description in “Administration file keyword descriptions” on page 327.
       Syntax:
       SCHEDD_RUNS_HERE = true | false
       Default value: true
    SCHEDD_SUBMIT_AFFINITY
       Specifies whether job submissions are directed to a locally running Schedd
       daemon. When the keyword is set to true, job submissions are directed to a
       Schedd daemon running on the same machine where the submission takes
       place, provided there is a Schedd daemon running on that machine. In this

                                                   Chapter 12. Configuration file reference   301
case the submission is said to have ″affinity″ for the local Schedd daemon. If
                                there is no Schedd daemon running on the machine where the submission
                                takes place, or if this keyword is set to false, the job submission will only be
                                directed to a Schedd daemon serving as a public scheduling machine. In this
                                case, if there are no public scheduling machines configured the job cannot be
                                submitted. A public scheduling machine accepts job submissions from other
                                machines in the LoadLeveler cluster. To configure a machine as a public
                                scheduling machine, see the schedd_host keyword description in
                                “Administration file keyword descriptions” on page 327.
                                Installations with a large number of nodes should consider setting this
                                keyword to false to more evenly distribute dispatching of jobs among the
                                Schedd daemons. For more information, see “Scaling considerations” on page
                                719.
                                Syntax:
                                SCHEDD_SUBMIT_AFFINITY = true | false
                                Default value: true
                            SCHEDD_STATUS_PORT
                               Specifies the port number used when connecting to the daemon.
                                Syntax:
                                SCHEDD_STATUS_PORT = port number
                                Default value: The default is 9606.
                                For more information related to using this keyword, see “Defining network
                                characteristics” on page 47.
                            SCHEDD_STREAM_PORT
                               Specifies the port number used when connecting to the daemon.
                                Syntax:
                                SCHEDD_STREAM_PORT = port number
                                Default value: The default is 9605.
                                For more information related to using this keyword, see “Defining network
                                characteristics” on page 47.
                            SCHEDULE_BY_RESOURCES
                               Specifies which consumable resources are considered by the LoadLeveler
                               schedulers. Each consumable resource name may be an administrator-defined
                               alphanumeric string, or may be one of the following predefined resources:
                               v ConsumableCpus
                               v ConsumableMemory
                               v ConsumableVirtualMemory
|                              v ConsumableLargePageMemory
                               v RDMA
                                Each string may only appear in the list once. These resources are either floating
                                resources, or machine resources. If any resource is specified incorrectly with
                                the SCHEDULE_BY_RESOURCES keyword, then all scheduling resources will
                                be ignored.

                                Syntax:
                                SCHEDULE_BY_RESOURCES = name name ... name
                                Default value: No default value is set.



    302   TWS LoadLeveler: Using and Administering
SCHEDULER_TYPE
   Specifies the LoadLeveler scheduling algorithm:
   LL_DEFAULT
         Specifies the default LoadLeveler scheduling algorithm. If
         SCHEDULER_TYPE has not been defined, LoadLeveler will use the
         default scheduler (LL_DEFAULT).
   BACKFILL
        Specifies the LoadLeveler BACKFILL scheduler. When you specify this
        keyword, you should use only the default settings for the START
        expression and the other job control expressions described in
        “Managing job status through control expressions” on page 68.
   API       Specifies that you will use an external scheduler. External schedulers
             communicate to LoadLeveler through the job control API. For more
             information on setting an external scheduler, see “Using an external
             scheduler” on page 115.
   Syntax:
   SCHEDULER_TYPE = LL_DEFAULT | BACKFILL | API
   Default value: LL_DEFAULT

   Note:
           1. If a scheduler type is not set, LoadLeveler will start, but it will use
              the default scheduler.
           2. If you have set SCHEDULER_TYPE with an option that is not valid,
              LoadLeveler will not start.
           3. If you change the scheduler option specified by
              SCHEDULER_TYPE, you must stop and restart LoadLeveler using
              llctl or recycle using llctl.
   For more information related to using this keyword, see “Defining a
   LoadLeveler cluster” on page 44.
SEC_ADMIN_GROUP
   When security services are enabled, this keyword points to the name of the
   UNIX group that contains the local identities of the LoadLeveler
   administrators.
   Restriction: CtSec security is not supported on LoadLeveler for Linux.
   Syntax:
   SEC_ADMIN_GROUP = name of lladmin group
   Default value: No default value is set.
   For more information related to using this keyword, see “Configuring
   LoadLeveler to use cluster security services” on page 57.
SEC_ENABLEMENT
   Specifies the security mechanism to be used.
   Restriction: Do not set this keyword to CtSec in the configuration file for a
   Linux machine. CtSec security is not supported on LoadLeveler for Linux.
   Syntax:
   SEC_ENABLEMENT = COMPAT | CTSEC
   Default value: No default value is set.


                                               Chapter 12. Configuration file reference   303
SEC_SERVICES_GROUP
                           When security services are enabled, this keyword specifies the name of the
                           LoadLeveler services group.
                            Restriction: CtSec security is not supported on LoadLeveler for Linux.
                            Syntax:
                            SEC_SERVICES_GROUP=group name

                            Where group name defines the identities of the LoadLeveler daemons.
                            Default value: No default value is set.
                        SEC_IMPOSED_MECHS
                           Specifies a blank-delimited list of LoadLeveler’s permitted security mechanisms
                           when Cluster Security (CtSec) services are enabled.
                            Restriction: CtSec security is not supported on LoadLeveler for Linux.
                            Syntax: Specify a blank delimited list containing combinations of the following
                            values:
                            none      If this is the only value specified, then users will run unauthenticated
                                      and, if authorization is necessary, the job will fail. If this is not the only
                                      value specified, then users may run unauthenticated and, if
                                      authorization is necessary, the job will fail.
                            unix      If this is the only value specified, then UNIX host-based authentication
                                      will be used; otherwise, other mechanisms may be used.
                            Default value: No default value is set.
                            Example:
                            SEC_IMPOSED_MECHS = none unix
                        SPOOL
                           Defines the local directory where LoadLeveler keeps the local job queue and
                           checkpoint files
                            Syntax:
                            SPOOL = local directory/spool
                            Default value: $(tilde)/spool
                        START
                           Determines whether a machine can run a LoadLeveler job.
                            Syntax:
                            START: expression that evaluates to T or F (true or false)

                            When the expression evaluates to T, LoadLeveler considers dispatching a job
                            to the machine. When you use a START expression that is based on the CPU
                            load average, the negotiator may evaluate the expression as F even though the
                            load average indicates the machine is Idle. This is because the negotiator adds
                            a compensating factor to the startd machine’s load average every time the
                            negotiator assigns a job. For more information, see the
                            NEGOTIATOR_INTERVAL keyword.
                            Default value: No default value is set, which means that no jobs will be
                            started.
                            For information about time-related variables that you may use for this
                            keyword, see “Variables to use for setting times” on page 320.


304   TWS LoadLeveler: Using and Administering
START_CLASS
   Specifies the rule for starting a job of the incoming_class. The START_CLASS
   rule is applied whenever the BACKFILL scheduler decides whether a job step
   of the incoming_class should start or not.
   Syntax:
   START_CLASS[incoming_class] = (start_class_expression) [ && (start_class_expression) ...]

   Where start_class_expression takes the form:
   run_class < number_of_tasks
          Which indicates that a job step of the incoming_class is only allowed to
          run on a node when the number of tasks of run_class running on that
          node is less than number_of_tasks.

   Note:
           1. START_CLASS [allclasses] will be ignored.
           2. The job class specified by run_class may be the same as or different
              from the class specified by incoming_class.
           3. You can also define run_class as allclasses. If you do, the total
              number of all job tasks running on that node cannot exceed the
              value specified by number_of_tasks.
           4. A class name or allclasses should not appear twice on the right-hand
              side of the keyword statement. However, you can use other class
              names with allclasses on the right hand side of the statement.
           5. If there is more than one start_class_expression, you must use &&
              between adjacent start_class_expressions.
           6. Both the START keyword and the START_CLASS keyword have to
              be true before a new job can start.
           7. Parenthesis ( ) are optional around start_class_expression.
   For information related to using this keyword, see “Planning to preempt jobs”
   on page 128.
   Default value: No default value is set.
   Examples:
   START_CLASS[Class_A] = (Class_A < 1)
         This statement indicates that a Class_A job can only start on nodes that
         do not have any Class_A jobs running.
   START_CLASS[Class_B] = allclasses < 5
         This statement indicates that a Class_B job can only start on nodes
         with maximum 4 tasks running.
START_DAEMONS
   Specifies whether to start the LoadLeveler daemons on the node.
   Syntax:
   START_DAEMONS = true | false
   Default value: true
   When true, the daemons are started. In most cases, you will probably want to
   set this keyword to true. An example of why this keyword would be set to
   false is if you want to run the daemons on most of the machines in the cluster
   but some individual users with their own local configuration files do not want
   their machines to run the daemons. The individual users would modify their


                                                 Chapter 12. Configuration file reference   305
local configuration files and set this keyword to false. Because the global
                            configuration file has the keyword set to true, their individual machines would
                            still be able to participate in the LoadLeveler cluster.
                            Also, to define the machine as strictly a submit-only machine, set this keyword
                            to false.
                        STARTD
                           Location of the startd executable (LoadL_startd).
                            Syntax:
                            STARTD = directory
                            Default value: $(BIN)/LoadL_startd
                            For more information related to using this keyword, see “How LoadLeveler
                            daemons process jobs” on page 8.
                        STARTD_COREDUMP_DIR
                           Local directory for storing LoadL_startd core dump files.
                            Syntax:
                            STARTD_COREDUMP_DIR = directory
                            Default value: The /tmp directory.
                            For more information related to using this keyword, see “Specifying file and
                            directory locations” on page 47.
                        STARTD_DGRAM_PORT
                           Specifies the port number used when connecting to the daemon.
                            Syntax:
                            STARTD_DGRAM_PORT = port number
                            Default value: The default is 9615.
                            For more information related to using this keyword, see “Defining network
                            characteristics” on page 47.
                        STARTD_RUNS_HERE = true | false
                           Specifies whether the startd daemon runs on the host. If you do not want to
                           run the startd daemon, specify false.
                            Syntax:
                            STARTD_RUNS_HERE = true | false
                            Default value: true
                        STARTD_STREAM_PORT
                           Specifies the port number used when connecting to the daemon.
                            Syntax:
                            STARTD_STREAM_PORT = port number
                            Default value: The default is 9611.
                            For more information related to using this keyword, see “Defining network
                            characteristics” on page 47.
                        STARTER
                           Location of the starter executable (LoadL_starter).
                            Syntax:
                            STARTER = directory


306   TWS LoadLeveler: Using and Administering
Default value: $(BIN)/LoadL_starter
    For more information related to using this keyword, see “How LoadLeveler
    daemons process jobs” on page 8.
STARTER_COREDUMP_DIR
   Local directory for storing LoadL_starter coredump files.
    Syntax:
    STARTER_COREDUMP_DIR = directory
    Default value: The /tmp directory.
    For more information related to using this keyword, see “Specifying file and
    directory locations” on page 47.
SUBMIT_FILTER
   Specifies the program you want to run to filter a job script when the job is
   submitted.
    Syntax:
    SUBMIT_FILTER = full_path_to_executable

    Where full_path_to_executable is called with the job command file as the
    standard input. The standard output is submitted to LoadLeveler. If the
    program returns with a nonzero exit code, the job submission is canceled. A
    submit filter can only make changes to LoadLeveler job command file keyword
    statements.
    Default value: No default value is set.
    Multicluster use: In a multicluster environment, if you specified a valid cluster
    list with either the llsubmit -X option or the ll_cluster API, then the
    SUBMIT_FILTER will instead be invoked with a modified job command file
    that contains a cluster_list keyword generated from either the llsubmit -X
    option or the ll_cluster API.
    The modified job command file will contain an inserted # @ cluster_list =
    cluster statement just prior to the first # @ queue statement. This cluster_list
    statement takes precedence and overrides all previous specifications of any
    cluster_list statements from the original job command file.
    Example: SUBMIT_FILTER in a multicluster environment
    The following job command file, job.cmd, requests to be run remotely on
    cluster1:
    #!/bin/sh
    # @ cluster_list = cluster1
    # @ error = job1.$(Host).$(Cluster).$(Process).err
    # @ output = job1.$(Host).$(Cluster).$(Process).out
    # @ queue

    After issuing llsubmit -X cluster2 job.cmd, the modified job command file
    statements will be run on cluster2:
    #!/bin/sh
    # @ cluster_list = cluster1
    # @ error = job1.$(Host).$(Cluster).$(Process).err
    # @ output = job1.$(Host).$(Cluster).$(Process).out
    # @ cluster_list = cluster2
    # @ queue
    For more information related to using this keyword, see “Filtering a job script”
    on page 76.

                                               Chapter 12. Configuration file reference   307
SUSPEND
                               Determines whether running jobs should be suspended.
                                Syntax:
                                SUSPEND: expression that evaluates to T or F (true or false)

                                When T, LoadLeveler temporarily suspends jobs currently running on the
                                machine. Suspended LoadLeveler jobs will either be continued or vacated. This
                                keyword is not supported for parallel jobs.

                                Default value: No default value is set.
                                For information about time-related variables that you may use for this
                                keyword, see “Variables to use for setting times” on page 320.
                            SYSPRIO
                               System priority expression.
                                Syntax:
                                SYSPRIO : expression

                                You can use the following LoadLeveler variables to define the SYSPRIO
                                expression:
                                v ClassSysprio
                                v GroupQueuedJobs
                                v GroupRunningJobs
                                v GroupSysprio
                                v GroupTotalJobs
                                v GroupTotalShares
                                v GroupUsedBgShares
                                v GroupUsedShares
                                v JobIsBlueGene
                                v QDate
|                               v UserHoldTime
                                v UserPrio
                                v UserQueuedJobs
                                v UserRunningJobs
                                v UserSysprio
                                v UserTotalJobs
                                v UserTotalShares
                                v UserUsedBgShares
                                v UserUsedShares
                                For detailed descriptions of these variables, see “LoadLeveler variables” on
                                page 314.
                                Default value: 0 (zero)

                                Note:
                                        1. The SYSPRIO keyword is valid only on the machine where the
                                           central manager is running. Using this keyword in a local
                                           configuration file has no effect.
                                        2. It is recommended that you do not use UserPrio in the SYSPRIO
                                           expression, since user jobs are already ordered by UserPrio.
                                        3. The string SYSPRIO can be used as both the name of an expression
                                           (SYSPRIO: value) and the name of a variable (SYSPRIO = value).
                                           To specify the expression to be used to calculate job priority you
                                           must use the syntax for the SYSPRIO expression. If the variable is

    308   TWS LoadLeveler: Using and Administering
mistakenly used for the SYSPRIO expression, which requires a colon
         (:) after the name, the job priority value will always be 0 because the
         SYSPRIO expression has not been defined.
      4. When the UserRunningJobs, GroupRunningJobs, UserQueuedJobs,
         GroupQueuedJobs, UserTotalJobs, GroupTotalJobs,
         GroupTotalShares, GroupUsedShares, UserTotalShares,
         UserUsedShares, GroupUsedBgShares, JobIsBlueGene, and
         UserUsedBgShares variables are used to prioritize the queue based
         on current usage, you should also set
         NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL so that the
         priorities are adjusted according to current usage rather than usage
         only at submission time.
Examples:
v Example 1
  This example creates a FIFO job queue based on submission time:
  SYSPRIO : 0 - (QDate)
v Example 2
  This example accounts for Class, User, and Group system priorities:
  SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate)
v Example 3
  This example orders the queue based on the number of jobs a user is
  currently running. The user who has the fewest jobs running is first in the
  queue. You should set
  NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL in conjunction with
  this SYSPRIO expression.
  SYSPRIO : 0 - UserRunningJobs
v Example 4
  This example shows one possible way to set up the SYSPRIO expression for
  fair share scheduling. For those jobs whose owner has no unused shares
  ($(UserHasShares)= 0), that job priority depends only on QDate, making it a
  simple FIFO queue as in Example 1.
  For those jobs whose owner has unused shares ($(UserHasShares)= 1), job
  priority depends not only on QDate, but also on a uniform boost of
  31 536 000 (the equivalent to the job being submitted one year earlier).
  These jobs still have priority differences because of submit time differences.
  It is like forming two priority tiers: the higher priority tier for jobs with
  unused shares and the lower priority tier for jobs without unused shares.
  SYSPRIO: 31536000 * $(UserHasShares) - QDate
v Example 5
  This example divides the jobs into three priority tiers:
  – Those jobs whose owner and group both have unused shares are at the
    top tier
  – Those jobs whose owner or group has unused shares are at the middle
    tier
  – Those jobs whose owner and group both have no shares remaining are at
    the bottom tier
  A user can submit two jobs to two different groups, the first job to a group
  with shares remaining and the second job to a group without any unused
  shares. If the user has unused shares, the first job will belong to the top tier
  and the second job will belong to the middle tier. If the user has no shares
  remaining, the first job will belong to the middle tier and the second job will

                                               Chapter 12. Configuration file reference   309
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5
Ibm tivoli workload scheduler load leveler using and administering v3.5

More Related Content

PDF
Ibm tivoli storage manager v6.1 server upgrade guide
PDF
Ibm tivoli storage manager for databases data protection for oracle for unix ...
PDF
Ibm tivoli storage manager for databases data protection for oracle for unix ...
PDF
Ibm tivoli storage manager for unix and linux backup archive client installat...
PDF
Ibm tivoli storage manager for linux administrator's reference 6.1
PDF
Ibm tivoli storage manager v6.1 technical guide sg247718
PDF
Getting started with ibm tivoli workload scheduler v8.3 sg247237
PDF
Ibm tivoli workload scheduler for z os best practices end-to-end and mainfram...
Ibm tivoli storage manager v6.1 server upgrade guide
Ibm tivoli storage manager for databases data protection for oracle for unix ...
Ibm tivoli storage manager for databases data protection for oracle for unix ...
Ibm tivoli storage manager for unix and linux backup archive client installat...
Ibm tivoli storage manager for linux administrator's reference 6.1
Ibm tivoli storage manager v6.1 technical guide sg247718
Getting started with ibm tivoli workload scheduler v8.3 sg247237
Ibm tivoli workload scheduler for z os best practices end-to-end and mainfram...

What's hot (17)

PDF
Ibm tivoli storage manager for aix server installation guide version 6.1
PDF
Ibm tivoli storage manager in a clustered environment sg246679
PDF
Certification guide series ibm tivoli workload scheduler v8.4 sg247628
PDF
Ibm tivoli storage resource manager a practical introduction sg246886
PDF
Ibm tivoli directory server installation and configuration guide - sc272747
PDF
Deployment guide series tivoli continuous data protection for files sg247235
PDF
Integrating ibm tivoli workload scheduler and content manager on demand to pr...
PDF
Implementing ibm tivoli workload scheduler v 8.2 extended agent for ibm tivol...
PDF
Ibm total storage san file system sg247057
PDF
Cesvip 2010 first_linux_module
PDF
Backing up lotus domino r5 using tivoli storage management sg245247
PDF
End to-end scheduling with ibm tivoli workload scheduler version 8.2 sg246624
PDF
Administering maximo asset management
PDF
A practical guide to tivoli sa nergy sg246146
PDF
Integrating ibm db2 with the ibm system storage n series sg247329
PDF
Deployment guide series ibm tivoli monitoring 6.1 sg247188
PDF
Disaster recovery solutions for ibm total storage san file system sg247157
Ibm tivoli storage manager for aix server installation guide version 6.1
Ibm tivoli storage manager in a clustered environment sg246679
Certification guide series ibm tivoli workload scheduler v8.4 sg247628
Ibm tivoli storage resource manager a practical introduction sg246886
Ibm tivoli directory server installation and configuration guide - sc272747
Deployment guide series tivoli continuous data protection for files sg247235
Integrating ibm tivoli workload scheduler and content manager on demand to pr...
Implementing ibm tivoli workload scheduler v 8.2 extended agent for ibm tivol...
Ibm total storage san file system sg247057
Cesvip 2010 first_linux_module
Backing up lotus domino r5 using tivoli storage management sg245247
End to-end scheduling with ibm tivoli workload scheduler version 8.2 sg246624
Administering maximo asset management
A practical guide to tivoli sa nergy sg246146
Integrating ibm db2 with the ibm system storage n series sg247329
Deployment guide series ibm tivoli monitoring 6.1 sg247188
Disaster recovery solutions for ibm total storage san file system sg247157
Ad

Similar to Ibm tivoli workload scheduler load leveler using and administering v3.5 (20)

PDF
Ibm tivoli storage manager for databases data protection for oracle for windo...
PDF
Ibm tivoli storage manager v6.1 server upgrade guide
PDF
Ibm tivoli storage manager for databases data protection for microsoft sql se...
PDF
Ibm tivoli storage manager for aix installation guide 6.2
PDF
Ds8800 plan guide
PDF
High availability scenarios with ibm tivoli workload scheduler and ibm tivoli...
PDF
Ibm tivoli storage manager for hsm for windows administration guide version 5.5
PDF
Mysql To Db2 Conversion Guide Ibm Redbooks
PDF
IBM Tivoli Netcool/OMNIbus: Administration Guide
PDF
Tivoli and web sphere application server on z os sg247062
PDF
Fasg02 mr
PDF
Ibm tivoli storage manager for windows installation guide 6.2
PDF
Ibm tivoli storage manager for windows installation guide 6.2
PDF
Deployment guide series ibm tivoli ccmdb overview and deployment planning sg2...
PDF
Deployment guide series ibm tivoli ccmdb overview and deployment planning sg2...
PDF
Deployment guide series ibm tivoli composite application manager for web sphe...
PDF
Deployment guide series ibm tivoli composite application manager for web sphe...
PDF
Zos1.13 migration
PDF
Enterprise Cobol Programming Guide For Zos And Os390 32 Ibm
PDF
Resdk java custo_webi_dg
Ibm tivoli storage manager for databases data protection for oracle for windo...
Ibm tivoli storage manager v6.1 server upgrade guide
Ibm tivoli storage manager for databases data protection for microsoft sql se...
Ibm tivoli storage manager for aix installation guide 6.2
Ds8800 plan guide
High availability scenarios with ibm tivoli workload scheduler and ibm tivoli...
Ibm tivoli storage manager for hsm for windows administration guide version 5.5
Mysql To Db2 Conversion Guide Ibm Redbooks
IBM Tivoli Netcool/OMNIbus: Administration Guide
Tivoli and web sphere application server on z os sg247062
Fasg02 mr
Ibm tivoli storage manager for windows installation guide 6.2
Ibm tivoli storage manager for windows installation guide 6.2
Deployment guide series ibm tivoli ccmdb overview and deployment planning sg2...
Deployment guide series ibm tivoli ccmdb overview and deployment planning sg2...
Deployment guide series ibm tivoli composite application manager for web sphe...
Deployment guide series ibm tivoli composite application manager for web sphe...
Zos1.13 migration
Enterprise Cobol Programming Guide For Zos And Os390 32 Ibm
Resdk java custo_webi_dg
Ad

More from Banking at Ho Chi Minh city (20)

PDF
Postgresql v15.1
PDF
Postgresql v14.6 Document Guide
PDF
IBM MobileFirst Platform v7.0 Pot Intro v0.1
PDF
IBM MobileFirst Platform v7 Tech Overview
PDF
IBM MobileFirst Foundation Version Flyer v1.0
PDF
IBM MobileFirst Platform v7.0 POT Offers Lab v1.0
PDF
IBM MobileFirst Platform v7.0 pot intro v0.1
PDF
IBM MobileFirst Platform v7.0 POT App Mgmt Lab v1.1
PDF
IBM MobileFirst Platform v7.0 POT Analytics v1.1
PDF
IBM MobileFirst Platform Pot Sentiment Analysis v3
PDF
IBM MobileFirst Platform 7.0 POT InApp Feedback V0.1
PDF
Tme 10 cookbook for aix systems management and networking sg244867
PDF
Tivoli firewall magic redp0227
PDF
Tivoli data warehouse version 1.3 planning and implementation sg246343
PDF
Tivoli data warehouse 1.2 and business objects redp9116
PDF
Tivoli business systems manager v2.1 end to-end business impact management sg...
PDF
Tec implementation examples sg245216
PDF
Tape automation with ibm e server xseries servers redp0415
PDF
Tivoli storage productivity center v4.2 release guide sg247894
PDF
Synchronizing data with ibm tivoli directory integrator 6.1 redp4317
Postgresql v15.1
Postgresql v14.6 Document Guide
IBM MobileFirst Platform v7.0 Pot Intro v0.1
IBM MobileFirst Platform v7 Tech Overview
IBM MobileFirst Foundation Version Flyer v1.0
IBM MobileFirst Platform v7.0 POT Offers Lab v1.0
IBM MobileFirst Platform v7.0 pot intro v0.1
IBM MobileFirst Platform v7.0 POT App Mgmt Lab v1.1
IBM MobileFirst Platform v7.0 POT Analytics v1.1
IBM MobileFirst Platform Pot Sentiment Analysis v3
IBM MobileFirst Platform 7.0 POT InApp Feedback V0.1
Tme 10 cookbook for aix systems management and networking sg244867
Tivoli firewall magic redp0227
Tivoli data warehouse version 1.3 planning and implementation sg246343
Tivoli data warehouse 1.2 and business objects redp9116
Tivoli business systems manager v2.1 end to-end business impact management sg...
Tec implementation examples sg245216
Tape automation with ibm e server xseries servers redp0415
Tivoli storage productivity center v4.2 release guide sg247894
Synchronizing data with ibm tivoli directory integrator 6.1 redp4317

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
A comparative analysis of optical character recognition models for extracting...
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
A Presentation on Artificial Intelligence
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity

Ibm tivoli workload scheduler load leveler using and administering v3.5

  • 1. Tivoli Workload Scheduler LoadLeveler Using and Administering Version 3 Release 5 SA22-7881-08
  • 3. Tivoli Workload Scheduler LoadLeveler Using and Administering Version 3 Release 5 SA22-7881-08
  • 4. Note Before using this information and the product it supports, read the information in “Notices” on page 745. Ninth Edition (November 2008) This edition applies to version 3, release 5, modification 0 of IBM Tivoli Workload Scheduler LoadLeveler (product numbers 5765-E69 and 5724-I23) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition replaces SA22-7881-07. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. IBM welcomes your comments. A form for readers’ comments may be provided at the back of this publication, or you can send your comments to the address: International Business Machines Corporation Department 58HA, Mail Station P181 2455 South Road Poughkeepsie, NY 12601-5400 United States of America FAX (United States & Canada): 1+845+432-9405 FAX (Other Countries): Your International Access Code +1+845+432-9405 IBMLink™ (United States customers only): IBMUSM10(MHVRCFS) Internet e-mail: mhvrcfs@us.ibm.com If you want a reply, be sure to include your name, address, and telephone or FAX number. Make sure to include the following in your comment or note: v Title and order number of this publication v Page number or topic related to your comment When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. ©Copyright 1986, 1987, 1988, 1989, 1990, 1991 by the Condor Design Team. ©Copyright International Business Machines Corporation 1986, 2008. All rights reserved. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
  • 5. Contents Figures . . . . . . . . . . . . . . . ix LoadLeveler for AIX and LoadLeveler for Linux compatibility . . . . . . . . . . . . . . 35 Tables . . . . . . . . . . . . . . . xi Restrictions for LoadLeveler for Linux . . . . 36 Features not supported in LoadLeveler for Linux 36 Restrictions for LoadLeveler for AIX and About this information . . . . . . . . xiii LoadLeveler for Linux mixed clusters . . . . 37 Who should use this information . . . . . . . xiii Conventions and terminology used in this information . . . . . . . . . . . . . . xiii Part 2. Configuring and managing Prerequisite and related information . . . . . . xiv the TWS LoadLeveler environment . 39 How to send your comments . . . . . . . . xv Chapter 4. Configuring the LoadLeveler Summary of changes . . . . . . . . xvii environment . . . . . . . . . . . . 41 Modifying a configuration file . . . . . . . . 42 Part 1. Overview of TWS Defining LoadLeveler administrators . . . . . . 43 Defining a LoadLeveler cluster . . . . . . . . 44 LoadLeveler concepts and operation 1 Choosing a scheduler . . . . . . . . . . 44 Setting negotiator characteristics and policies . . 45 Chapter 1. What is LoadLeveler? . . . . 3 Specifying alternate central managers . . . . . 46 LoadLeveler basics . . . . . . . . . . . . 4 Defining network characteristics . . . . . . 47 LoadLeveler: A network job management and Specifying file and directory locations . . . . 47 scheduling system . . . . . . . . . . . . 4 Configuring recording activity and log files. . . 48 Job definition . . . . . . . . . . . . . 5 Setting up file system monitoring . . . . . . 54 Machine definition . . . . . . . . . . . 6 Defining LoadLeveler machine characteristics . . . 54 How LoadLeveler schedules jobs . . . . . . . 7 Defining job classes that a LoadLeveler machine How LoadLeveler daemons process jobs . . . . . 8 will accept . . . . . . . . . . . . . . 55 The master daemon . . . . . . . . . . . 9 Specifying how many jobs a machine can run . . 55 The Schedd daemon . . . . . . . . . . 10 Defining security mechanisms . . . . . . . . 56 The startd daemon . . . . . . . . . . . 11 Configuring LoadLeveler to use cluster security The negotiator daemon . . . . . . . . . 13 services . . . . . . . . . . . . . . . 57 The kbdd daemon . . . . . . . . . . . 14 Defining usage policies for consumable resources . . 60 The gsmonitor daemon . . . . . . . . . 14 Enabling support for bulk data transfer and rCxt The LoadLeveler job cycle . . . . . . . . . 16 blocks . . . . . . . . . . . . . . . . 61 LoadLeveler job states . . . . . . . . . . 19 Gathering job accounting data . . . . . . . . 61 Consumable resources . . . . . . . . . . . 22 Collecting job resource data on serial and parallel Consumable resources and AIX Workload jobs . . . . . . . . . . . . . . . . 62 Manager . . . . . . . . . . . . . . 24 | Collecting accounting information for recurring Overview of reservations . . . . . . . . . . 25 | jobs . . . . . . . . . . . . . . . . 63 Fair share scheduling overview . . . . . . . . 27 Collecting accounting data for reservations . . . 63 Collecting job resource data based on machines 64 Chapter 2. Getting a quick start using Collecting job resource data based on events . . 64 the default configuration . . . . . . . 29 Collecting job resource information based on user What you need to know before you begin . . . . 29 accounts . . . . . . . . . . . . . . 65 Using the default configuration files . . . . . . 29 Collecting the accounting information and LoadLeveler for Linux quick start . . . . . . . 30 storing it into files . . . . . . . . . . . 66 Quick installation . . . . . . . . . . . 30 Producing accounting reports . . . . . . . 66 Quick configuration . . . . . . . . . . 30 Correlating AIX and LoadLeveler accounting Quick verification . . . . . . . . . . . 30 records . . . . . . . . . . . . . . . 66 Post-installation considerations . . . . . . . . 31 64-bit support for accounting functions . . . . 67 Starting LoadLeveler . . . . . . . . . . 31 Example: Setting up job accounting files . . . . 67 Location of directories following installation . . 32 Managing job status through control expressions . . 68 How control expressions affect jobs . . . . . 69 Tracking job processes . . . . . . . . . . . 70 Chapter 3. What operating systems are Querying multiple LoadLeveler clusters . . . . . 71 supported by LoadLeveler? . . . . . . 35 Handling switch-table errors. . . . . . . . . 72 iii
  • 6. Providing additional job-processing controls through | Configuring LoadLeveler to support data installation exits . . . . . . . . . . . . . 72 | staging . . . . . . . . . . . . . . 114 Controlling the central manager scheduling cycle 73 Using an external scheduler . . . . . . . . 115 Handling DCE security credentials . . . . . 74 Replacing the default LoadLeveler scheduling Handling an AFS token . . . . . . . . . 75 algorithm with an external scheduler . . . . 116 Filtering a job script . . . . . . . . . . 76 Customizing the configuration file to define an Writing prolog and epilog programs . . . . . 77 external scheduler . . . . . . . . . . . 118 Using your own mail program . . . . . . . 81 Steps for getting information about the LoadLeveler cluster, its machines, and jobs . . 118 Chapter 5. Defining LoadLeveler Assigning resources and dispatching jobs . . . 122 resources to administer . . . . . . . 83 Example: Changing scheduler types . . . . . . 126 Preempting and resuming jobs . . . . . . . 126 Steps for modifying an administration file . . . . 83 Overview of preemption . . . . . . . . 127 Defining machines . . . . . . . . . . . . 84 Planning to preempt jobs . . . . . . . . 128 Planning considerations for defining machines . 85 Steps for configuring a scheduler to preempt Machine stanza format and keyword summary 86 jobs . . . . . . . . . . . . . . . 130 Examples: Machine stanzas . . . . . . . . 86 Configuring LoadLeveler to support reservations 131 Defining adapters . . . . . . . . . . . . 86 Steps for configuring reservations in a Configuring dynamic adapters . . . . . . . 87 LoadLeveler cluster . . . . . . . . . . 132 Configuring InfiniBand adapters . . . . . . 87 Steps for integrating LoadLeveler with the AIX Adapter stanza format and keyword summary 88 Workload Manager . . . . . . . . . . . 137 Examples: Adapter stanzas . . . . . . . . 89 LoadLeveler support for checkpointing jobs . . . 139 Defining classes . . . . . . . . . . . . . 89 Checkpoint keyword summary . . . . . . 139 Using limit keywords . . . . . . . . . . 89 Planning considerations for checkpointing jobs 140 Allowing users to use a class . . . . . . . 92 AIX checkpoint and restart limitations . . . . 141 Class stanza format and keyword summary . . 92 Naming checkpoint files and directories . . . 145 Examples: Class stanzas . . . . . . . . . 93 Removing old checkpoint files . . . . . . . 146 Defining user substanzas in class stanzas . . . . 94 LoadLeveler scheduling affinity support . . . . 146 Examples: Substanzas . . . . . . . . . . 95 Configuring LoadLeveler to use scheduling Defining users . . . . . . . . . . . . . 97 affinity . . . . . . . . . . . . . . 147 User stanza format and keyword summary . . . 97 LoadLeveler multicluster support. . . . . . . 148 Examples: User stanzas . . . . . . . . . 98 Configuring a LoadLeveler multicluster . . . 150 Defining groups . . . . . . . . . . . . . 99 | Scale-across scheduling with multiclusters . . . 153 Group stanza format and keyword summary . . 99 LoadLeveler Blue Gene support . . . . . . . 155 Examples: Group stanzas . . . . . . . . . 99 Configuring LoadLeveler Blue Gene support 157 Defining clusters . . . . . . . . . . . . 100 Blue Gene reservation support. . . . . . . 159 Cluster stanza format and keyword summary 100 Blue Gene fair share scheduling support . . . 159 Examples: Cluster stanzas . . . . . . . . 100 Blue Gene heterogeneous memory support . . 160 Blue Gene preemption support . . . . . . 160 Chapter 6. Performing additional Blue Gene/L HTC partition support . . . . . 160 administrator tasks . . . . . . . . . 103 Using fair share scheduling . . . . . . . . . 160 Setting up the environment for parallel jobs . . . 104 Fair share scheduling keywords . . . . . . 161 Scheduling considerations for parallel jobs . . 104 Reconfiguring fair share scheduling keywords 163 Steps for reducing job launch overhead for Example: three groups share a LoadLeveler parallel jobs . . . . . . . . . . . . . 105 cluster . . . . . . . . . . . . . . . 164 Steps for allowing users to submit interactive Example: two thousand students share a POE jobs . . . . . . . . . . . . . . 106 LoadLeveler cluster . . . . . . . . . . 165 Setting up a class for parallel jobs . . . . . 106 Querying information about fair share | Striping when some networks fail . . . . . 107 scheduling . . . . . . . . . . . . . 166 Setting up a parallel master node . . . . . . 108 Resetting fair share scheduling . . . . . . 166 Configuring LoadLeveler to support MPICH Saving historic data . . . . . . . . . . 166 jobs . . . . . . . . . . . . . . . 108 Restoring saved historic data . . . . . . . 167 Configuring LoadLeveler to support MVAPICH Procedure for recovering a job spool. . . . . . 167 jobs . . . . . . . . . . . . . . . 108 Configuring LoadLeveler to support Chapter 7. Using LoadLeveler’s GUI to MPICH-GM jobs . . . . . . . . . . . 109 perform administrator tasks . . . . . 169 Using the BACKFILL scheduler . . . . . . . 110 Job-related administrative actions. . . . . . . 169 Tips for using the BACKFILL scheduler . . . 112 Machine-related administrative actions . . . . . 172 Example: BACKFILL scheduling . . . . . . 113 | Data staging . . . . . . . . . . . . . . 113 iv TWS LoadLeveler: Using and Administering
  • 7. Part 3. Submitting and managing Checkpointing a job . . . . . . . . . . . 232 TWS LoadLeveler jobs . . . . . . 177 Chapter 10. Example: Using commands to build, submit, and Chapter 8. Building and submitting manage jobs . . . . . . . . . . . . 235 jobs . . . . . . . . . . . . . . . 179 Building a job command file . . . . . . . . 179 Using multiple steps in a job command file . . 180 Chapter 11. Using LoadLeveler’s GUI Examples: Job command files . . . . . . . 181 to build, submit, and manage jobs . . 237 Editing job command files . . . . . . . . . 185 Building jobs . . . . . . . . . . . . . 237 Defining resources for a job step . . . . . . . 185 Editing the job command file . . . . . . . . 249 | Submitting jobs requesting data staging . . . . 186 Submitting a job command file . . . . . . . 250 Working with coscheduled job steps . . . . . . 187 Displaying and refreshing job status . . . . . . 251 Submitting coscheduled job steps . . . . . . 187 Sorting the Jobs window . . . . . . . . . 252 Determining priority for coscheduled job steps 187 Changing the priority of your jobs . . . . . . 253 Supporting preemption of coscheduled job steps 187 Placing a job on hold . . . . . . . . . . . 253 Coscheduled job steps and commands and APIs 188 Releasing the hold on a job . . . . . . . . . 253 Termination of coscheduled steps . . . . . . 188 Canceling a job . . . . . . . . . . . . . 254 Using bulk data transfer . . . . . . . . . . 188 Modifying consumable resources and other job Preparing a job for checkpoint/restart . . . . . 190 attributes . . . . . . . . . . . . . . . 254 Preparing a job for preemption . . . . . . . 193 Taking a checkpoint . . . . . . . . . . . 254 Submitting a job command file . . . . . . . 193 Adding a job to a reservation . . . . . . . . 255 Submitting a job using a submit-only machine 194 Removing a job from a reservation . . . . . . 255 Working with parallel jobs . . . . . . . . . 194 Displaying and refreshing machine status . . . . 255 Step for controlling whether LoadLeveler copies Sorting the Machines window . . . . . . . . 257 environment variables to all executing nodes . . 195 Finding the location of the central manager . . . 257 Ensuring that parallel jobs in a cluster run on Finding the location of the public scheduling the correct levels of PE and LoadLeveler machines . . . . . . . . . . . . . . . 258 software . . . . . . . . . . . . . . 195 Finding the type of scheduler in use . . . . . . 258 Task-assignment considerations . . . . . . 196 Specifying which jobs appear in the Jobs window 258 Submitting jobs that use striping . . . . . . 198 Specifying which machines appear in Machines Running interactive POE jobs . . . . . . . 203 window . . . . . . . . . . . . . . . 259 Running MPICH, MVAPICH, and MPICH-GM Saving LoadLeveler messages in a file . . . . . 259 jobs . . . . . . . . . . . . . . . 204 Examples: Building parallel job command files 207 Part 4. TWS LoadLeveler Obtaining status of parallel jobs . . . . . . 212 Obtaining allocated host names . . . . . . 212 interfaces reference . . . . . . . 261 Working with reservations . . . . . . . . . 213 Understanding the reservation life cycle . . . 214 Chapter 12. Configuration file Creating new reservations . . . . . . . . 216 reference . . . . . . . . . . . . . 263 Submitting jobs to run under a reservation . . 218 Configuration file syntax . . . . . . . . . 263 Removing bound jobs from the reservation . . 220 Numerical and alphabetical constants . . . . 264 Querying existing reservations . . . . . . 221 Mathematical operators . . . . . . . . . 264 Modifying existing reservations . . . . . . 221 64-bit support for configuration file keywords Canceling existing reservations . . . . . . 222 and expressions . . . . . . . . . . . 264 Submitting jobs requesting scheduling affinity . . 222 Configuration file keyword descriptions . . . . 265 Submitting and monitoring jobs in a LoadLeveler User-defined keywords . . . . . . . . . . 313 multicluster . . . . . . . . . . . . . . 223 LoadLeveler variables . . . . . . . . . . 314 Steps for submitting jobs in a LoadLeveler Variables to use for setting dates . . . . . . 319 multicluster environment . . . . . . . . 224 Variables to use for setting times . . . . . . 320 Submitting and monitoring Blue Gene jobs . . . 226 Chapter 13. Administration file Chapter 9. Managing submitted jobs 229 reference . . . . . . . . . . . . . 321 Querying the status of a job . . . . . . . . 229 Administration file structure and syntax . . . . 321 Working with machines . . . . . . . . . . 230 Stanza characteristics . . . . . . . . . . 323 Displaying currently available resources . . . . 230 Syntax for limit keywords . . . . . . . . 324 Setting and changing the priority of a job . . . . 230 64-bit support for administration file keywords 325 Example: How does a job’s priority affect Administration file keyword descriptions . . . . 327 dispatching order?. . . . . . . . . . . 231 Placing and releasing a hold on a job . . . . . 232 Canceling a job . . . . . . . . . . . . . 232 Contents v
  • 8. Chapter 14. Job command file llstatus - Query machine status . . . . . . . 512 reference . . . . . . . . . . . . . 357 llsubmit - Submit a job . . . . . . . . . . 531 Job command file syntax . . . . . . . . . 357 llsummary - Return job resource information for Serial job command file . . . . . . . . . 357 accounting . . . . . . . . . . . . . . 535 Parallel job command file . . . . . . . . 358 Syntax for limit keywords . . . . . . . . 358 Chapter 17. Application programming 64-bit support for job command file keywords 358 interfaces (APIs) . . . . . . . . . . 541 Job command file keyword descriptions . . . . 359 64-bit support for the LoadLeveler APIs . . . . 543 Job command file variables . . . . . . . . 399 LoadLeveler for AIX APIs . . . . . . . . 543 Run-time environment variables . . . . . . 400 LoadLeveler for Linux APIs . . . . . . . 544 Job command file examples . . . . . . . 401 Accounting API . . . . . . . . . . . . 544 GetHistory subroutine . . . . . . . . . 545 Chapter 15. Graphical user interface llacctval user exit . . . . . . . . . . . 547 (GUI) reference . . . . . . . . . . . 403 Checkpointing API . . . . . . . . . . . 548 Starting the GUI . . . . . . . . . . . . 403 ckpt subroutine . . . . . . . . . . . . 549 Specifying GUI options . . . . . . . . . 404 ll_ckpt subroutine . . . . . . . . . . . 550 The LoadLeveler main window . . . . . . 404 ll_init_ckpt subroutine . . . . . . . . . 553 Getting help using the GUI . . . . . . . . 405 ll_set_ckpt_callbacks subroutine . . . . . . 555 Differences between LoadLeveler’s GUI and ll_unset_ckpt_callbacks subroutine . . . . . 556 other graphical user interfaces . . . . . . . 406 Configuration API . . . . . . . . . . . . 557 GUI typographic conventions . . . . . . . 406 ll_config_changed subroutine . . . . . . . 558 64-bit support for the GUI . . . . . . . . 407 ll_read_config subroutine . . . . . . . . 559 Customizing the GUI . . . . . . . . . . . 407 Data access API . . . . . . . . . . . . 560 Syntax of an Xloadl file . . . . . . . . . 407 Using the data access API . . . . . . . . 560 Modifying windows and buttons . . . . . . 408 Understanding the LoadLeveler data access Creating your own pull-down menus . . . . 409 object model. . . . . . . . . . . . . 561 Customizing fields on the Jobs window and the Understanding the Blue Gene object model . . 562 Machines window . . . . . . . . . . . 409 Understanding the Class object model . . . . 562 Modifying help panels . . . . . . . . . 410 Understanding the Cluster object model . . . 563 Understanding the Fairshare object model . . . 563 Understanding the Job object model . . . . . 564 Chapter 16. Commands . . . . . . . 411 Understanding the Machine object model . . . 565 llacctmrg - Collect machine history files . . . . 413 Understanding the MCluster object model . . . 566 llbind - Bind job steps to a reservation . . . . . 415 Understanding the Reservations object model 566 llcancel - Cancel a submitted job . . . . . . . 421 Understanding the Wlmstat object model . . . 567 llchres - Change attributes of a reservation . . . 424 ll_deallocate subroutine . . . . . . . . . 568 llckpt - Checkpoint a running job step . . . . . 430 ll_free_objs subroutine . . . . . . . . . 569 llclass - Query class information . . . . . . . 433 ll_get_data subroutine . . . . . . . . . 570 llclusterauth - Generates public and private keys 438 ll_get_objs subroutine . . . . . . . . . 624 llctl - Control LoadLeveler daemons . . . . . . 439 ll_next_obj subroutine . . . . . . . . . 627 llextRPD - Extract data from an RSCT peer domain 443 ll_query subroutine . . . . . . . . . . 628 llfavorjob - Reorder system queue by job . . . . 447 ll_reset_request subroutine . . . . . . . . 629 llfavoruser - Reorder system queue by user . . . 449 ll_set_request subroutine . . . . . . . . 630 llfs - Fair share scheduling queries and operations 450 Examples of using the data access API . . . . 633 llhold - Hold or release a submitted job . . . . 454 Error handling API . . . . . . . . . . . 639 llinit - Initialize machines in the LoadLeveler ll_error subroutine. . . . . . . . . . . 640 cluster . . . . . . . . . . . . . . . . 457 Fair share scheduling API . . . . . . . . . 641 llmkres - Make a reservation . . . . . . . . 459 ll_fair_share subroutine . . . . . . . . . 642 llmodify - Change attributes of a submitted job Reservation API . . . . . . . . . . . . 643 step . . . . . . . . . . . . . . . . 464 ll_bind subroutine . . . . . . . . . . . 645 llmovejob - Move a single idle job from the local ll_change_reservation subroutine . . . . . . 648 cluster to another cluster . . . . . . . . . 470 ll_init_reservation_param subroutine . . . . 652 llmovespool - Move job records . . . . . . . 472 ll_make_reservation subroutine . . . . . . 653 llpreempt - Preempt a submitted job step . . . . 474 ll_remove_reservation subroutine . . . . . . 658 llprio - Change the user priority of submitted job | ll_remove_reservation_xtnd subroutine . . . . 660 steps . . . . . . . . . . . . . . . . 477 Submit API . . . . . . . . . . . . . . 663 llq - Query job status . . . . . . . . . . . 479 llfree_job_info subroutine . . . . . . . . 664 llqres - Query a reservation . . . . . . . . . 500 llsubmit subroutine . . . . . . . . . . 665 llrmres - Cancel a reservation . . . . . . . . 508 monitor_program user exit . . . . . . . . 667 llrunscheduler - Run the central manager’s Workload management API . . . . . . . . 668 scheduling algorithm . . . . . . . . . . . 511 ll_cluster subroutine . . . . . . . . . . 669 vi TWS LoadLeveler: Using and Administering
  • 9. ll_cluster_auth subroutine . . . . . . . . 671 How do I find my remote job? . . . . . . 716 ll_control subroutine . . . . . . . . . . 673 Why won’t my remote job run? . . . . . . 717 ll_modify subroutine . . . . . . . . . . 677 Why does llq -X all show no jobs running when ll_move_job subroutine . . . . . . . . . 681 there are jobs running? . . . . . . . . . 717 ll_move_spool subroutine . . . . . . . . 683 Troubleshooting in a Blue Gene environment . . . 717 ll_preempt subroutine . . . . . . . . . 686 Why do all of my Blue Gene jobs fail even ll_preempt_jobs subroutine . . . . . . . . 688 though llstatus shows that Blue Gene is present? 718 ll_run_scheduler subroutine . . . . . . . 691 Why does llstatus show that Blue Gene is ll_start_job_ext subroutine . . . . . . . . 692 absent? . . . . . . . . . . . . . . 718 ll_terminate_job subroutine . . . . . . . . 696 Why did my Blue Gene job fail when the job was submitted to a remote cluster? . . . . . 718 Appendix A. Troubleshooting | Why does llmkres or llchres return ″Insufficient LoadLeveler . . . . . . . . . . . . 699 | resources to meet the request″ for a Blue Gene | reservation when resources appear to be Frequently asked questions . . . . . . . . . 699 | available?. . . . . . . . . . . . . . 719 Why won’t LoadLeveler start? . . . . . . . 700 Helpful hints . . . . . . . . . . . . . 719 Why won’t my job run? . . . . . . . . . 700 Scaling considerations . . . . . . . . . 719 Why won’t my parallel job run? . . . . . . 703 Hints for running jobs . . . . . . . . . 720 Why won’t my checkpointed job restart? . . . 704 Hints for using machines . . . . . . . . 723 Why won’t my submit-only job run? . . . . 705 History files and Schedd . . . . . . . . 724 Why won’t my job run on a cluster with both Getting help from IBM . . . . . . . . . . 724 AIX and Linux machines? . . . . . . . . 705 | Why won’t my job run when scheduling affinity | is enabled on x86 and x86_64 systems? . . . . 705 Appendix B. Sample command output 725 Why does a job stay in the Pending (or Starting) llclass -l command output listing . . . . . . . 725 state? . . . . . . . . . . . . . . . 706 llq -l command output listing . . . . . . . . 727 What happens to running jobs when a machine llq -l command output listing for a Blue Gene goes down? . . . . . . . . . . . . . 706 enabled system . . . . . . . . . . . . . 729 Why won’t my jobs run that were directed to an llq -l -x command output listing . . . . . . . 730 idle pool? . . . . . . . . . . . . . 708 llstatus -l command output listing . . . . . . 733 What happens if the central manager isn’t llstatus -l -b command output listing . . . . . 733 operating? . . . . . . . . . . . . . 708 llstatus -B command output listing . . . . . . 735 How do I recover resources allocated by a llstatus -P command output listing . . . . . . 736 Schedd machine? . . . . . . . . . . . 710 llsummary -l -x command output listing . . . . 736 Why can’t I find a core file on Linux? . . . . 710 llsummary -l -x command output listing for a Blue Why am I seeing inconsistencies in my llfs Gene-enabled system . . . . . . . . . . . 738 output? . . . . . . . . . . . . . . 711 Why don’t I see my job when I issue the llq Appendix C. LoadLeveler port usage 741 command? . . . . . . . . . . . . . 711 What happens if errors are found in my Accessibility features for TWS configuration or administration file? . . . . . 711 LoadLeveler . . . . . . . . . . . . 743 Other questions . . . . . . . . . . . 712 Accessibility features . . . . . . . . . . . 743 Troubleshooting in a multicluster environment . . 714 Keyboard navigation . . . . . . . . . . . 743 How do I determine if I am in a multicluster IBM and accessibility . . . . . . . . . . . 743 environment? . . . . . . . . . . . . 714 How do I determine how my multicluster environment is defined and what are the Notices . . . . . . . . . . . . . . 745 inbound and outbound hosts defined for each Trademarks . . . . . . . . . . . . . . 746 cluster? . . . . . . . . . . . . . . 714 Why is my multicluster environment not Glossary . . . . . . . . . . . . . 749 enabled? . . . . . . . . . . . . . . 714 How do I find log messages from my Index . . . . . . . . . . . . . . . 753 multicluster-defined installation exits? . . . . 715 Why won’t my remote job be submitted or moved? . . . . . . . . . . . . . . 715 Why did the CLUSTER_REMOTE_JOB_FILTER not update the job with all of the statements I defined? . . . . . . . . . . . . . . 716 Contents vii
  • 10. viii TWS LoadLeveler: Using and Administering
  • 11. Figures 1. Example of a LoadLeveler cluster . . . . . 3 28. MPICH job command file - sample 1 208 2. LoadLeveler job steps . . . . . . . . . 5 29. MPICH job command file - sample 2 209 3. Multiple roles of machines . . . . . . . . 7 30. MPICH-GM job command file - sample 1 210 4. High-level job flow . . . . . . . . . . 16 31. MPICH-GM job command file - sample 2 210 5. Job is submitted to LoadLeveler . . . . . . 17 32. MVAPICH job command file - sample 1 211 6. LoadLeveler authorizes the job . . . . . . 17 33. MVAPICH job command file - sample 2 212 7. LoadLeveler prepares to run the job . . . . 18 34. Using LOADL_PROCESSOR_LIST in a shell 8. LoadLeveler starts the job . . . . . . . . 18 script . . . . . . . . . . . . . . 213 9. LoadLeveler completes the job . . . . . . 19 35. Building a job command file . . . . . . 235 10. How control expressions affect jobs . . . . 70 36. LoadLeveler build a job window . . . . . 238 11. Format of a machine stanza . . . . . . . 86 37. Format of administration file stanzas 322 12. Format of an adapter stanza . . . . . . . 88 38. Format of administration file substanzas 322 13. Format of a class stanza . . . . . . . . 93 39. Sample administration file stanzas . . . . 322 14. Format of a user substanza . . . . . . . 95 40. Sample administration file stanza with user 15. Format of a user stanza . . . . . . . . 98 substanzas . . . . . . . . . . . . 323 16. Format of a group stanza . . . . . . . . 99 41. Serial job command file . . . . . . . . 358 17. Format of a cluster stanza . . . . . . . 100 42. Main window of the LoadLeveler GUI 405 18. Multicluster Example . . . . . . . . . 101 43. Creating a new pull-down menu . . . . . 409 19. Job command file with multiple steps 181 44. TWS LoadLeveler Blue Gene object model 562 20. Job command file with multiple steps and 45. TWS LoadLeveler Class object model 563 one executable . . . . . . . . . . . 181 46. TWS LoadLeveler Cluster object model 563 21. Job command file with varying input 47. TWS LoadLeveler Fairshare object model 563 statements . . . . . . . . . . . . 182 48. TWS LoadLeveler Job object model . . . . 565 22. Using LoadLeveler variables in a job 49. TWS LoadLeveler Machine object model 566 command file . . . . . . . . . . . 183 50. TWS LoadLeveler MCluster object model 566 23. Job command file used as the executable 185 51. TWS LoadLeveler Reservations object model 566 24. Striping over multiple networks . . . . . 200 52. TWS LoadLeveler Wlmstat object model 567 25. Striping over a single network . . . . . . 202 53. When the primary central manager is 26. POE job command file – multiple tasks per unavailable . . . . . . . . . . . . 709 node . . . . . . . . . . . . . . 207 54. Multiple central managers . . . . . . . 709 27. POE sample job command file – invoking POE twice . . . . . . . . . . . . 208 ix
  • 12. x TWS LoadLeveler: Using and Administering
  • 13. Tables 1. Summary of typographic conventions xiv | 35. Keywords for configuring scale-across 2. Major topics in TWS LoadLeveler: Using and | scheduling . . . . . . . . . . . . 154 Administering . . . . . . . . . . . . 1 36. IBM System Blue Gene Solution 3. Topics in the TWS LoadLeveler overview 3 documentation . . . . . . . . . . . 156 4. LoadLeveler daemons . . . . . . . . . 8 37. Blue Gene subtasks and associated 5. startd determines whether its own state instructions . . . . . . . . . . . . 157 permits a new job to run . . . . . . . . 12 38. Blue Gene related topics and associated 6. Job state descriptions and abbreviations 20 information . . . . . . . . . . . . 157 7. Location and description of product directories 39. Blue Gene configuring subtasks and following installation . . . . . . . . . 33 associated instructions . . . . . . . . 157 8. Location and description of directories for 40. Learning about building and submitting jobs 179 submit-only LoadLeveler . . . . . . . . 33 41. Roadmap of user tasks for building and 9. Roadmap of tasks for TWS LoadLeveler submitting jobs . . . . . . . . . . . 179 administrators . . . . . . . . . . . 41 42. Standard files for the five job steps . . . . 182 10. Roadmap of administrator tasks related to 43. Checkpoint configurations . . . . . . . 191 using or modifying the LoadLeveler | 44. Valid combinations of task assignment configuration file . . . . . . . . . . . 42 | keywords are listed in each column . . . . 196 11. Roadmap for defining LoadLeveler cluster 45. node and total_tasks . . . . . . . . . 196 characteristics . . . . . . . . . . . . 44 46. Blocking . . . . . . . . . . . . . 197 12. Default locations for all of the files and 47. Unlimited blocking . . . . . . . . . 198 directories . . . . . . . . . . . . . 47 48. Roadmap of tasks for reservation owners and 13. Log control statements . . . . . . . . . 49 users . . . . . . . . . . . . . . 213 14. Roadmap of configuration tasks for securing 49. Reservation states, abbreviations, and usage LoadLeveler operations . . . . . . . . 57 notes . . . . . . . . . . . . . . 214 15. Roadmap of tasks for gathering job accounting 50. Instructions for submitting a job to run under data . . . . . . . . . . . . . . . 62 a reservation . . . . . . . . . . . . 219 16. Collecting account data - modifying the 51. Submitting and monitoring jobs in a configuration file . . . . . . . . . . . 67 LoadLeveler multicluster . . . . . . . . 224 17. Roadmap of administrator tasks accomplished 52. Roadmap of user tasks for managing through installation exits . . . . . . . . 72 submitted jobs . . . . . . . . . . . 229 18. Roadmap of tasks for modifying the 53. How LoadLeveler handles job priorities 231 LoadLeveler administration file . . . . . . 83 54. User tasks available through the GUI 237 19. Types of limit keywords . . . . . . . . 90 55. GUI fields and input . . . . . . . . . 239 20. Enforcing job step limits . . . . . . . . 91 56. Nodes dialog box . . . . . . . . . . 243 21. Setting limits . . . . . . . . . . . . 92 57. Network dialog box fields . . . . . . . 244 22. Roadmap of additional administrator tasks 103 58. Build a job dialog box fields . . . . . . 245 23. Roadmap of BACKFILL scheduler tasks 111 59. Limits dialog box fields . . . . . . . . 247 24. Roadmap of tasks for using an external 60. Checkpointing dialog box fieldsF . . . . . 248 scheduler . . . . . . . . . . . . . 116 61. Blue Gene job fields . . . . . . . . . 248 25. Effect of LoadLeveler keywords under an 62. Modifying the job command file with the Edit external scheduler . . . . . . . . . . 116 pull-down menu . . . . . . . . . . 249 26. Roadmap of tasks for using preemption 127 63. Modifying the job command file with the 27. Preemption methods for which LoadLeveler Tools pull-down menu . . . . . . . . 250 automatically resumes preempted jobs . . . 129 64. Saving and submitting information . . . . 250 28. Preemption methods for which administrator 65. Sorting the jobs window . . . . . . . . 252 or user intervention is required . . . . . 130 66. Sorting the machines window . . . . . . 257 29. Roadmap of reservation tasks for 67. Specifying which jobs appear in the Jobs administrators . . . . . . . . . . . 132 window . . . . . . . . . . . . . 258 30. Roadmap of tasks for checkpointing jobs 139 68. Specifying which machines appear in 31. Deciding where to define the directory for Machines window . . . . . . . . . . 259 staging executables . . . . . . . . . 141 69. Configuration subtasks . . . . . . . . 263 32. Multicluster support subtasks and associated 70. BG_MIN_PARTITION_SIZE values . . . . 268 instructions . . . . . . . . . . . . 149 71. Administration file subtasks . . . . . . 321 33. Multicluster support related topics . . . . 149 72. Notes on 64-bit support for administration 34. Subtasks for configuring a LoadLeveler file keywords . . . . . . . . . . . 325 multicluster . . . . . . . . . . . . 150 xi
  • 14. 73. Summary of possible values set for the 90. FAIRSHARE specifications for ll_get_data env_copy keyword in the administration file . 335 subroutine . . . . . . . . . . . . 582 74. Sample user and group settings for the 91. JOBS specifications for ll_get_data subroutine 583 max_reservations keyword . . . . . . . 345 92. MACHINES specifications for ll_get_data 75. Job command file subtasks . . . . . . . 357 subroutine . . . . . . . . . . . . 614 76. Notes on 64-bit support for job command file 93. MCLUSTERS specifications for ll_get_data keywords . . . . . . . . . . . . . 358 subroutine . . . . . . . . . . . . 619 77. mcm_affinity_options default values . . . . 381 94. RESERVATIONS specifications for ll_get_data 78. Example of a selection table . . . . . . . 406 subroutine . . . . . . . . . . . . 620 79. Decision table . . . . . . . . . . . 407 95. WLMSTAT specifications for ll_get_data 80. Decision table actions . . . . . . . . . 407 subroutine . . . . . . . . . . . . 622 81. Window identifiers in the Xloadl file 408 96. query_daemon summary . . . . . . . . 624 82. Resource variables for all the windows and 97. query_flags summary . . . . . . . . . 630 the buttons . . . . . . . . . . . . 408 98. object_filter value related to the query flags 83. Modifying help panels . . . . . . . . 410 value . . . . . . . . . . . . . . 631 84. LoadLeveler command summary . . . . . 411 99. enum LL_reservation_data type . . . . . 649 85. llmodify options and keywords . . . . . 468 100. How nodes should be arranged in the node 86. LoadLeveler API summary . . . . . . . 541 list . . . . . . . . . . . . . . . 694 87. BLUE_GENE specifications for ll_get_data 101. Why your job might not be running . . . . 700 subroutine . . . . . . . . . . . . 571 102. Why your job might not be running . . . . 703 88. CLASSES specifications for ll_get_data 103. Troubleshooting running jobs when a subroutine . . . . . . . . . . . . 576 machine goes down . . . . . . . . . 706 89. CLUSTERS specifications for ll_get_data 104. LoadLeveler default port usage . . . . . 741 subroutine . . . . . . . . . . . . 580 xii TWS LoadLeveler: Using and Administering
  • 15. About this information IBM® Tivoli® Workload Scheduler (TWS) LoadLeveler® provides various ways of scheduling and managing applications for best performance and most efficient use of resources. LoadLeveler manages both serial and parallel jobs over a cluster of machines or servers, which may be desktop workstations, dedicated servers, or parallel machines. This information describes how to configure and administer this LoadLeveler cluster environment, and to submit and manage jobs that run on machines in the cluster. Who should use this information This information is intended for two separate audiences: v Personnel who are responsible for installing, configuring and managing the LoadLeveler cluster environment. These people are called LoadLeveler administrators. LoadLeveler administrative tasks include: – Setting up configuration and administration files – Maintaining the LoadLeveler product – Setting up the distributed environment for allocating batch jobs v Users who submit and manage serial and parallel jobs to run in the LoadLeveler cluster. Both LoadLeveler administrators and general users should be experienced with the UNIX® commands. Administrators also should be familiar with: v Cluster system management techniques such as SMIT, as it is used in the AIX® environment v Networking and NFS or AFS® protocols Conventions and terminology used in this information Throughout the TWS LoadLeveler product information: v TWS LoadLeveler for Linux® Multiplatform includes: | – IBM System servers with Advanced Micro Devices (AMD) Opteron or Intel® | Extended Memory 64 Technology (EM64T) processors – IBM System x™ servers – IBM BladeCenter® Intel processor-based servers – IBM Cluster 1350™ Note: IBM Tivoli Workload Scheduler LoadLeveler is supported when running Linux on non-IBM Intel-based and AMD hardware servers. Supported hardware includes: | – Servers with Intel 32-bit and Intel EM64T | – Servers with AMD 64-bit technology v Note that in this information: – LoadLeveler is also referred to as Tivoli Workload Scheduler LoadLeveler and TWS LoadLeveler. – Switch_Network_Interface_For_HPS is also referred to as HPS or High Performance Switch. xiii
  • 16. Table 1 describes the typographic conventions used in this information. Table 1. Summary of typographic conventions Typographic Usage Bold v Bold words or characters represent system elements that you must use literally, such as commands, flags, and path names. v Bold words also indicate the first use of a term included in the glossary. Italic v Italic words or characters represent variable values that you must supply. v Italics are also used for book titles and for general emphasis in text. Constant Examples and information that the system displays appear in constant width width typeface. [] Brackets enclose optional items in format and syntax descriptions. {} Braces enclose a list from which you must choose an item in format and syntax descriptions. | A vertical bar separates items in a list of choices. (In other words, it means “or.”) <> Angle brackets (less-than and greater-than) enclose the name of a key on the keyboard. For example, <Enter> refers to the key on your terminal or workstation that is labeled with the word Enter. ... An ellipsis indicates that you can repeat the preceding item one or more times. <Ctrl-x> The notation <Ctrl-x> indicates a control character sequence. For example, <Ctrl-c> means that you hold down the control key while pressing <c>. The continuation character is used in coding examples in this information for formatting purposes. Prerequisite and related information The Tivoli Workload Scheduler LoadLeveler publications are: v Installation Guide, GI10-0763 v Using and Administering, SA22-7881 v Diagnosis and Messages Guide, GA22-7882 To access all TWS LoadLeveler documentation, refer to the IBM Cluster Information Center, which contains the most recent TWS LoadLeveler documentation in PDF and HTML formats. This Web site is located at: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp A TWS LoadLeveler Documentation Updates file also is maintained on this Web site. The TWS LoadLeveler Documentation Updates file contains updates to the TWS LoadLeveler documentation. These updates include documentation corrections and clarifications that were discovered after the TWS LoadLeveler books were published. Both the current TWS LoadLeveler books and earlier versions of the library are also available in PDF format from the IBM Publications Center Web site located at: http://www.elink.ibmlink.ibm.com/publications/servlet/pbi.wss To easily locate a book in the IBM Publications Center, supply the book’s publication number. The publication number for each of the TWS LoadLeveler books is listed after the book title in the preceding list. xiv TWS LoadLeveler: Using and Administering
  • 17. How to send your comments Your feedback is important in helping us to produce accurate, high-quality information. If you have any comments about this book or any other TWS LoadLeveler documentation: v Send your comments by e-mail to: mhvrcfs@us.ibm.com Include the book title and order number, and, if applicable, the specific location of the information you have comments on (for example, a page number or a table number). v Fill out one of the forms at the back of this book and return it by mail, by fax, or by giving it to an IBM representative. To contact the IBM cluster development organization, send your comments by e-mail to: cluster@us.ibm.com About this information xv
  • 18. xvi TWS LoadLeveler: Using and Administering
  • 19. Summary of changes The following sections summarize changes to the IBM Tivoli Workload Scheduler (TWS) LoadLeveler product and TWS LoadLeveler library for each new release or major service update for a given product version. Within each information unit in the library, a vertical line to the left of text and illustrations indicates technical changes or additions made to the previous edition of the information. Changes to TWS LoadLeveler for this release or update include: v New information: – Recurring reservation support: - The TWS LoadLeveler commands and APIs have been enhanced to support recurring reservation. - Accounting records have been enhanced to have recurring reservation entries. - The new recurring job command file keyword will allow a user to specify that the job can run in every occurrence of the recurring reservation to which it is bound. – Data staging support: - Jobs can request data files from a remote storage location before the job executes and back to remote storage after it finishes execution. - Schedules data staging at submit time or just in time for the application execution. – Multicluster scale-across scheduling support: - Allows a large job to span resources across more than one cluster v Scale-across scheduling is a way to schedule jobs in the multicluster environment to span resources across more than one cluster. This feature allows large jobs that request more resources than any single cluster can provide to combine the resources from more than one cluster and run large jobs on the combined resources, effectively spanning resources across more than one cluster. v Allows utilization of fragmented resources from more than one cluster – Fragmented resources occur when the resources available on a single cluster cannot satisfy any single job on that cluster. This feature allows any size job to take advantage of these resources by combining them from multiple clusters. – Enhanced WLM support: - Integrates TWS LoadLeveler with AIX Workload Manager (WLM) virtual memory and the large page resource limit support. - Enforces virtual memory and the large page limit usage of a job. - Reports statistics for virtual memory and the large page limit usage. - Dynamically changes virtual memory and the large page limit usage of a job. – Enhanced adapter striping (sn_all) support: - Submits jobs to nodes that have one or more networks in the failed (NOTREADY) state provided that all of the nodes assigned to the job have more than half of the networks in the READY state. xvii
  • 20. - A new striping_with_minimum_networks configuration keyword has been added to the class stanza to support striping with failed networks. – Enhanced affinity support: - Task affinity support has been enhanced on nodes that are booted in single threaded (ST) mode and on nodes that do not support simultaneous multithreading (SMT). – NetworkID64 for Mellanox adapters on Linux systems with InfiniBand support: - Generates unique NetworkID64 IDs for adapter ports that are connected to the same switch and have the same IP subnet address. This ensures that ports that are connected to the same switch, but are configured with different IP subnet addresses, will get different NeworkID64 values. v Changed information: – This is the last release that will provide the following functions: - The Motif-based graphical user interface xloadl. The function available in xloadl has been frozen since TWS LoadLeveler 3.3.2 and there are no plans to update this GUI with any new function added to TWS LoadLeveler after that level. - The IBM BladeCenter JS21 with a BladeCenter H chassis interconnected with the InfiniBand Host Channel Adapters connected to a Cisco InfiniBand SDR switch. - The IBM Power System 575 (Model 9118-575) and IBM Power System 550 (Model 9133-55A) interconnected with the InfiniBand Host Channel Adapter and Cisco switch. - The High Performance Switch. – If you have a mixed TWS LoadLeveler cluster and need to run your job on a specific operating system or architecture, you must define the requirements keyword statement in your job command file specifying the desired Arch or OpSys. For example: Requirements: (Arch == "RS6000") && (OpSys == "AIX53") v Deleted information: The following function is no longer supported and the information has been removed: – The scheduling of parallel jobs with the default scheduler (SCHEDULER_TYPE=LL_DEFAULT) – The min_processors and max_processors keywords – The RSET_CONSUMABLE_CPUS option for the rset_support configuration keyword and the rset job command file keyword – The API functions: - ll_get_nodes - ll_free_nodes - ll_get_jobs - ll_free_jobs - ll_start_job – Red Hat Enterprise Linux 3 – The llctl purgeschedd function has been replaced by the llmovespool function. – The lldbconvert function is no longer needed for migration and the lldbconvert command is not included in TWS LoadLeveler 3.5. xviii TWS LoadLeveler: Using and Administering
  • 21. Part 1. Overview of TWS LoadLeveler concepts and operation Setting up IBM Tivoli Workload Scheduler (TWS) LoadLeveler involves defining machines, users, jobs, and how they interact, in such a way that TWS LoadLeveler is able to run jobs quickly and efficiently. Once you have a basic understanding of the TWS LoadLeveler product and its interfaces, you can find more details in the topics listed in Table 2. Table 2. Major topics in TWS LoadLeveler: Using and Administering To learn about: Read the following: Performing administrator tasks Part 2, “Configuring and managing the TWS LoadLeveler environment,” on page 39 Performing general user tasks Part 3, “Submitting and managing TWS LoadLeveler jobs,” on page 177 Using TWS LoadLeveler interfaces Part 4, “TWS LoadLeveler interfaces reference,” on page 261 1
  • 22. 2 TWS LoadLeveler: Using and Administering
  • 23. Chapter 1. What is LoadLeveler? LoadLeveler is a job management system that allows users to run more jobs in less time by matching the jobs’ processing needs with the available resources. LoadLeveler schedules jobs, and provides functions for building, submitting, and processing jobs quickly and efficiently in a dynamic environment. Figure 1 shows the different environments to which LoadLeveler can schedule jobs. Together, these environments comprise the LoadLeveler cluster. LoadLeveler cluster IBM Power Systems running AIX Submit-only workstations IBM eServer Cluster 1350 running Linux IBM BladeCenter running Linux Figure 1. Example of a LoadLeveler cluster As Figure 1 also illustrates, a LoadLeveler cluster can include submit-only machines, which allow users to have access to a limited number of LoadLeveler features. Throughout all the topics, the terms workstation, machine, node, and operating system instance (OSI) refer to the machines in your cluster. In LoadLeveler, an OSI is treated as a single instance of an operating system image. If you are unfamiliar with the TWS LoadLeveler product, consider reading one or more of the introductory topics listed in Table 3: Table 3. Topics in the TWS LoadLeveler overview To learn about: Read the following: Using the default configuration for Chapter 2, “Getting a quick start using the default getting a quick start configuration,” on page 29 Specific products and features that are Chapter 3, “What operating systems are supported required for or available through the by LoadLeveler?,” on page 35 TWS LoadLeveler environment 3
  • 24. LoadLeveler basics LoadLeveler has various types of interfaces that enable users to create and submit jobs and allow system administrators to configure the system and control running jobs. These interfaces include: v Control files that define the elements, characteristics, and policies of LoadLeveler and the jobs it manages. These files are the configuration file, the administration file, and job command file. v The command line interface, which gives you access to basic job and administrative functions. v A graphical user interface (GUI), which provides system access similar to the command line interface. Experienced users and administrators may find the command line interface more efficient than the GUI for job and administrative functions. v An application programming interface (API), which allows application programs written by users and administrators to interact with the LoadLeveler environment. The commands, GUI, and APIs permit different levels of access to administrators and users. User access is typically restricted to submitting and managing individual jobs, while administrative access allows setting up system configurations, job scheduling, and accounting. Using either the command line or the GUI, users create job command files that instruct the system on how to process information. Each job command file consists of keywords followed by the user defined association for that keyword. For example, the keyword executable tells LoadLeveler that you are about to define the name of a program you want to run. Therefore, executable = longjob tells LoadLeveler to run the program called longjob. After creating the job command file, you invoke LoadLeveler commands to monitor and control the job as it moves through the system. LoadLeveler monitors each job as it moves through the system using process control daemons. However, the administrator maintains ultimate control over all LoadLeveler jobs by defining job classes that control how and when LoadLeveler will run a job. In addition to setting up job classes, the administrator can also control how jobs move through the system by specifying the type of scheduler. LoadLeveler has several different scheduler options that start jobs using specific algorithms to balance job priority with available machine resources. When LoadLeveler administrators are configuring clusters and when users are planning jobs, they need to be aware of the machine resources available in the cluster. These resources include items like the number of CPUs and the amount of memory available for each job. Because resource availability will vary over time, LoadLeveler defines them as consumable resources. LoadLeveler: A network job management and scheduling system A network job management and job scheduling system, such as LoadLeveler, is a software program that schedules and manages jobs that you submit to one or more machines under its control. 4 TWS LoadLeveler: Using and Administering
  • 25. LoadLeveler accepts jobs that users submit and reviews the job requirements. LoadLeveler then examines the machines under its control to determine which machines are best suited to run each job. Job definition LoadLeveler schedules your jobs on one or more machines for processing. The definition of a job, in this context, is a set of job steps. or each job step, you can specify a different executable (the executable is the part of the job that gets processed). You can use LoadLeveler to submit jobs which are made up of one or more job steps, where each job step depends upon the completion status of a previous job step. For example, Figure 2 illustrates a stream of job steps: 1. Copy data from tape 2. Check exit status Job job command file exit status = y exit status = x Q Job step 1 Q Job step 2 1. Process data 2. Check exit status Q Job step 3 exit status = y exit status = x Format and print results End program Figure 2. LoadLeveler job steps Each of these job steps is defined in a single job command file. A job command file specifies the name of the job, as well as the job steps that you want to submit, and can contain other LoadLeveler statements. LoadLeveler tries to execute each of your job steps on a machine that has enough resources to support executing and checkpointing each step. If your job command file has multiple job steps, the job steps will not necessarily run on the same machine, unless you explicitly request that they do. You can submit batch jobs to LoadLeveler for scheduling. Batch jobs run in the background and generally do not require any input from the user. Batch jobs can either be serial or parallel. A serial job runs on a single machine. A parallel job is a program designed to execute as a number of individual, but related, processes on one or more of your system’s nodes. When executed, these related processes can communicate with each other (through message passing or shared memory) to exchange data or synchronize their execution. For parallel jobs, LoadLeveler interacts with Parallel Operating Environment (POE) to allocate nodes, assign tasks to nodes, and launch tasks. Chapter 1. What is LoadLeveler? 5
  • 26. Machine definition For LoadLeveler to schedule a job on a machine, the machine must be a valid member of the LoadLeveler cluster. A cluster is the combination of all of the different types of machines that use LoadLeveler. To make a machine a member of the LoadLeveler cluster, the administrator has to install the LoadLeveler software onto the machine and identify the central manager (described in “Roles of machines”). Once a machine becomes a valid member of the cluster, LoadLeveler can schedule jobs to it. Roles of machines Each machine in the LoadLeveler cluster performs one or more roles in scheduling jobs. Roles performed in scheduling jobs by each machine in the LoadLeveler cluster are as follows: v Scheduling Machine: When a job is submitted, it gets placed in a queue managed by a scheduling machine. This machine contacts another machine that serves as the central manager for the entire LoadLeveler cluster. This scheduling machine asks the central manager to find a machine that can run the job, and also keeps persistent information about the job. Some scheduling machines are known as public scheduling machines, meaning that any LoadLeveler user can access them. These machines schedule jobs submitted from submit-only machines: v Central Manager Machine: The role of the central manager is to examine the job’s requirements and find one or more machines in the LoadLeveler cluster that will run the job. Once it finds the machine(s), it notifies the scheduling machine. v Executing Machine: The machine that runs the job is known as the executing machine. v Submitting Machine: This type of machine is known as a submit-only machine. It participates in the LoadLeveler cluster on a limited basis. Although the name implies that users of these machines can only submit jobs, they can also query and cancel jobs. Users of these machines also have their own Graphical User Interface (GUI) that provides them with the submit-only subset of functions. The submit-only machine feature allows workstations that are not part of the LoadLeveler cluster to submit jobs to the cluster. Keep in mind that one machine can assume multiple roles, as shown in Figure 3 on page 7. 6 TWS LoadLeveler: Using and Administering
  • 27. Scheduling machine Executing LoadLeveler machine cluster Central manager Scheduling machine Submit-only Executing machines machine Scheduling machine Executing machine Figure 3. Multiple roles of machines Machine availability There may be times when some of the machines in the LoadLeveler cluster are not available to process jobs For instance, when the owners of the machines have decided to make them unavailable. This ability of LoadLeveler to allow users to restrict the use of their machines provides flexibility and control over the resources. Machine owners can make their personal workstations available to other LoadLeveler users in several ways. For example, you can specify that: v The machine will always be available v The machine will be available only between certain hours v The machine will be available when the keyboard and mouse are not being used interactively. Owners can also specify that their personal workstations never be made available to other LoadLeveler users. How LoadLeveler schedules jobs When a user submits a job, LoadLeveler examines the job command file to determine what resources the job will need. LoadLeveler determines which machine, or group of machines, is best suited to provide these resources, then LoadLeveler dispatches the job to the appropriate machines. To aid this process, LoadLeveler uses queues. A job queue is a list of jobs that are waiting to be processed. When a user submits a job to LoadLeveler, the job is entered into an internal database, which resides on one of the machines in the LoadLeveler cluster, until it is ready to be dispatched to run on another machine. Chapter 1. What is LoadLeveler? 7
  • 28. Once LoadLeveler examines a job to determine its required resources, the job is dispatched to a machine to be processed. A job can be dispatched to either one machine, or in the case of parallel jobs, to multiple machines. Once the job reaches the executing machine, the job runs. Jobs do not necessarily get dispatched to machines in the cluster on a first-come, first-serve basis. Instead, LoadLeveler examines the requirements and characteristics of the job and the availability of machines, and then determines the best time for the job to be dispatched. LoadLeveler also uses job classes to schedule jobs to run on machines. A job class is a classification to which a job can belong. For example, short running jobs may belong to a job class called short_jobs. Similarly, jobs that are only allowed to run on the weekends may belong to a class called weekend. The system administrator can define these job classes and select the users that are authorized to submit jobs of these classes. You can specify which types of jobs will run on a machine by specifying the types of job classes the machine will support. LoadLeveler also examines a job’s priority to determine when to schedule the job on a machine. A priority of a job is used to determine its position among a list of all jobs waiting to be dispatched. “The LoadLeveler job cycle” on page 16 describes job flow in the LoadLeveler environment in more detail. How LoadLeveler daemons process jobs LoadLeveler has its own set of daemons that control the processes moving jobs through the LoadLeveler cluster. The LoadLeveler daemons are programs that run continuously and control the processes that move jobs through the LoadLeveler cluster. A master daemon (LoadL_master) runs on all machines in the LoadLeveler cluster and manages other daemons. Table 4 summarizes these daemons, which are described in further detail in topics immediately following the table. Table 4. LoadLeveler daemons Daemon Description LoadL_master Referred to as the master daemon. Runs on all machines in the LoadLeveler cluster and manages other daemons. LoadL_schedd Referred to as the Schedd daemon. Receives jobs from the llsubmit command and manages them on machines selected by the negotiator daemon (as defined by the administrator). LoadL_startd Referred to as the startd daemon. Monitors job and machine resources on local machines and forwards information to the negotiator daemon. The startd daemon spawns the starter process (LoadL_starter) which manages running jobs on the executing machine. 8 TWS LoadLeveler: Using and Administering
  • 29. Table 4. LoadLeveler daemons (continued) Daemon Description LoadL_negotiator Referred to as the negotiator daemon. Monitors the status of each job and machine in the cluster. Responds to queries from llstatus and llq commands. Runs on the central manager machine. LoadL_kbdd Referred to as the keyboard daemon. Monitors keyboard and mouse activity. LoadL_GSmonitor Referred to as the gsmonitor daemon. Monitors for down machines based on the heartbeat responses of the MACHINE_UPDATE_INTERVAL time period. The master daemon The master daemon runs on every machine in the LoadLeveler cluster, except the submit-only machines. The real and effective user ID of this daemon must be root. The LoadL_master binary is installed as a setuid program with the owner set to root. The master daemon and all daemons started by the master must be able to run with root privileges in order to switch the identity to the owner of any job being processed. The master daemon determines whether to start any other daemons by checking the START_DAEMONS keyword in the global or local configuration file. If the keyword is set to true, the daemons are started. If the keyword is set to false, the master daemon terminates and generates a message. The master daemon will not start on a Linux machine if SEC_ENABLEMENT is set to CTSEC. If the master daemon does not start, no other daemons will start. On the machine designated as the central manager, the master runs the negotiator daemon. The master also controls the central manager backup function. The negotiator runs on either the primary or an alternate central manager. If a central manager failure is detected, one of the alternate central managers becomes the primary central manager by starting the negotiator. The master daemon starts and if necessary, restarts all of the LoadLeveler daemons that the machine it resides on is configured to run. As part of its startup procedure, this daemon executes the .llrc file (a dummy file is provided in the bin subdirectory of the release directory). You can use this script to customize your local configuration file, specifying what particular data is stored locally. This daemon also runs the kbdd daemon, which monitors keyboard and mouse activity. When the master daemon detects a failure on one of the daemons that it is monitoring, it attempts to restart it. Because this daemon recognizes that certain situations may prevent a daemon from running, it limits its restart attempts to the number defined for the RESTARTS_PER_HOUR keyword in the configuration file. If this limit is exceeded, the master daemon forces all daemons including itself to exit. When a daemon must be restarted, the master sends mail to the administrators identified by the LOADL_ADMIN keyword in the configuration file. The mail contains the name of the failing daemon, its termination status, and a section of the daemon’s most recent log file. If the master aborts after exceeding RESTARTS_PER_HOUR, it will also send that mail before exiting. Chapter 1. What is LoadLeveler? 9
  • 30. The master daemon may perform the following actions in response to an llctl command: v Kill all daemons and exit (stop keyword) v Kill all daemons and execute a new master (recycle keyword) v Rerun the .llrc file, reread the configuration files, stop or start daemons as appropriate for the new configuration files (reconfig keyword) v Send drain request to startd and (drain keyword) v Send flush request to startd and send result to caller (flush keyword) v Send suspend request to startd and send result to caller (suspend keyword) v Send resume request to startd and Schedd, and send result to caller (resume keyword) The Schedd daemon The Schedd daemon receives jobs sent by the llsubmit command and manages those jobs to machines selected by the negotiator daemon. The Schedd daemon is started, restarted, signalled, and stopped by the master daemon. The Schedd daemon can be in any one of the following activity states: Available This machine is available to schedule jobs. Drained The Schedd machine accepts no more jobs. There are no jobs in starting or running state. Jobs in the Idle state are drained, meaning they will not get dispatched. Draining The Schedd daemon is being drained by the administrator but some jobs are still running. The state of the machine remains Draining until all running jobs complete. At that time, the machine status changes to Drained. Down The daemon is not running on this machine. The Schedd daemon enters this state when it has not reported its status to the negotiator. This can occur when the machine is actually down, or because there is a network failure. The Schedd daemon performs the following functions: v Assigns new job identifiers when requested by the job submission process (for example, by the llsubmit command). v Receives new jobs from the llsubmit command. A new job is received as a job object for each job step. A job object is the data structure in memory containing all the information about a job step. The Schedd forwards the job object to the negotiator daemon as soon as it is received from the submit command. v Maintains on disk copies of jobs submitted locally (on this machine) that are either waiting or running on a remote (different) machine. The central manager can use this information to reconstruct the job information in the event of a failure. This information is also used for accounting purposes. v Responds to directives sent by the administrator through the negotiator daemon. The directives include: – Run a job. – Change the priority of a job. – Remove a job. – Hold or release a job. – Send information about all jobs. 10 TWS LoadLeveler: Using and Administering
  • 31. v Sends job events to the negotiator daemon when: – Schedd is restarting. – A new series of job objects are arriving. – A job is started. – A job was rejected, completed, removed, or vacated. Schedd determines the status by examining the exit status returned by the startd. v Communicates with the Parallel Operating Environment (POE) when you run an interactive POE job. v Requests that a remote startd daemon end a job. v Receives accounting information from startd. v Receives requests for reservations. v Collects resource usage data when jobs terminate and stores it as historic fair share data in the $(SPOOL) directory. v Sends historic fair share data to the central manager when it is updated or when the Schedd daemon is restarted. v Maintains and stores records of historic CPU and IBM System Blue Gene® Solution utilization for users and groups known to the Schedd. v Passes the historic CPU and Blue Gene utilization data to the central manager. The startd daemon The startd daemon monitors the status of each job, reservation, and machine in the cluster, and forwards this information to the negotiator daemon. The startd also receives and executes job requests originating from remote machines. The master daemon starts, restarts, signals, and stops the startd daemon. Checkpoint/restart is not supported in LoadLeveler for Linux. If a checkpointed job is sent to a Linux node, the Linux node will reject the job. The startd daemon can be in any one of the following states: Busy The maximum number of jobs are running on this machine as specified by the MAX_STARTERS configuration keyword. Down The daemon is not running on this machine. The startd daemon enters this state when it has not reported its status to the negotiator. This can occur when the machine is actually down, or because there is a network failure. Drained The startd machine will not accept any new jobs. No jobs are running when startd is in the drained state. Draining The startd daemon is being drained by the administrator, but some jobs are still running. The machine remains in the draining state until all of the running jobs have completed, at which time the machine status changes to drained. The startd daemon will not accept any new jobs while in the draining state. Flush Any running jobs have been vacated (terminated and returned to the queue to be redispatched). The startd daemon will not accept any new jobs. Idle The machine is not running any jobs. None LoadLeveler is running on this machine, but no jobs can run here. Chapter 1. What is LoadLeveler? 11
  • 32. Running The machine is running one or more jobs and is capable of running more. Suspend All LoadLeveler jobs running on this machine are stopped (cease processing), but remain in virtual memory. The startd daemon will not accept any new jobs. The startd daemon performs these functions: v Runs a time-out procedure that includes building a snapshot of the state of the machine that includes static and dynamic data. This time-out procedure is run at the following times: – After a job completes. – According to the definition of the POLLING_FREQUENCY keyword in the configuration file. v Records the following information in LoadLeveler variables and sends the information to the negotiator. – State (of the startd daemon) – EnteredCurrentState – Memory – Disk – KeyboardIdle – Cpus – LoadAvg – Machine – Adapter – AvailableClasses v Calculates the SUSPEND, RESUME, CONTINUE, and VACATE expressions through which you can manage job status. v Receives job requests from the Schedd daemon to: – Start a job – Preempt or resume a job – Vacate a job – Cancel When the Schedd daemon tells the startd daemon to start a job, the startd determines whether its own state permits a new job to run: Table 5. startd determines whether its own state permits a new job to run If: Then this happens: Yes, it can start a new The startd forks a starter process. job No, it cannot start a The startd rejects the request for one of the following reasons: new job v Jobs have been suspended, flushed, or drained v The job limit set for the MAX_STARTERS keyword has been reached v There are not enough classes available for the designated job class v Receives requests from the master (through the llctl command) to do one of the following: – Drain (drain keyword) – Flush (flush keyword) – Suspend (suspend keyword) – Resume (resume keyword) 12 TWS LoadLeveler: Using and Administering
  • 33. v For each request, startd marks its own new state, forwards its new state to the negotiator daemon, and then performs the appropriate action for any jobs that are active. v Receives notification of keyboard and mouse activity from the kbdd daemon v Periodically examines the process table for LoadLeveler jobs and accumulates resources consumed by those jobs. This resource data is used to determine if a job has exceeded its job limit and for recording in the history file. v Send accounting information to Schedd. The starter process The startd daemon spawns a starter process after the Schedd daemon tells the startd daemon to start a job. The starter process manages all the processes associated with a job step. The starter process is responsible for running the job and reporting status back to the startd daemon. The starter process performs these functions: v Processes the prolog and epilog programs as defined by the JOB_PROLOG and JOB_EPILOG keywords in the configuration file. The job will not run if the prolog program exits with a return code other than zero. v Handles authentication. This includes: – Authenticates AFS, if necessary – Verifies that the submitting user is not root – Verifies that the submitting user has access to the appropriate directories in the local file system. v Runs the job by forking a child process that runs with the user ID and all groups of the submitting user. That child process creates a new process group of which it is the process group leader, and executes the user’s program or a shell. The starter process is responsible for detecting the termination of any process that it forks. To ensure that all processes associated with a job are terminated after the process forked by the starter terminates, process tracking must be enabled. To configure LoadLeveler for process tracking, see “Tracking job processes” on page 70. v Responds to vacate and suspend orders from the startd. The negotiator daemon The negotiator daemon maintains status of each job and machine in the cluster and responds to queries from the llstatus and llq commands. The negotiator daemon runs on a single machine in the cluster (the central manager machine). This daemon is started, restarted, signalled, and stopped by the master daemon. In a mixed cluster, the negotiator daemon must run on an AIX node. The negotiator daemon receives status messages from each Schedd and startd daemon running in the cluster. The negotiator daemon tracks: v Which Schedd daemons are running v Which startd daemons are running, and the status of each startd machine. Chapter 1. What is LoadLeveler? 13
  • 34. If the negotiator does not receive an update from any machine within the time period defined by the MACHINE_UPDATE_INTERVAL keyword, then the negotiator assumes that the machine is down, and therefore the Schedd and startd daemons are also down. The negotiator also maintains in its memory several queues and tables which determine where the job should run. The negotiator performs the following functions: v Receives and records job status changes from the Schedd daemon. v Schedules jobs based on a variety of scheduling criteria and policy options. Once a job is selected, the negotiator contacts the Schedd that originally created the job. v Handles requests to: – Set priorities – Query about jobs, machines, classes, and reservations – Change reservation attributes – Bind jobs to reservations – Remove a reservation – Remove a job – Hold or release a job – Favor or unfavor a user or a job. v Receives notification of Schedd resets indicating that a Schedd has restarted. The kbdd daemon The kbdd daemon monitors keyboard and mouse activity. The kbdd daemon is spawned by the master daemon if the X_RUNS_HERE keyword in the configuration file is set to true. The kbdd daemon notifies the startd daemon when it detects keyboard or mouse activity; however, kbdd is not interrupt driven. It sleeps for the number of seconds defined by the POLLING_FREQUENCY keyword in the LoadLeveler configuration file, and then determines if X events, in the form of mouse or keyboard activity, have occurred. For more information on the configuration file, see Chapter 5, “Defining LoadLeveler resources to administer,” on page 83. The gsmonitor daemon The gsmonitor daemon is not available in LoadLeveler for Linux. The negotiator daemon monitors for down machines based on the heartbeat responses of the MACHINE_UPDATE_INTERVAL time period. If the negotiator has not received an update after two MACHINE_UPDATE_INTERVAL periods, then it marks the machine as down, and notifies the Schedd to remove any jobs running on that machine. The gsmonitor daemon (LoadL_GSmonitor) allows this cleanup to occur more reliably. The gsmonitor daemon uses the Group Services Application Programming Interface (GSAPI) to monitor machine availability on peer domains and to notify the negotiator quickly when a machine is no longer reachable. If the GSMONITOR_DOMAIN keyword was not specified in the LoadLeveler configuration file, then LoadLeveler will try to determine if the machine is running in a peer (cluster) domain. The gsmonitor must run in a peer domain. The 14 TWS LoadLeveler: Using and Administering
  • 35. gsmonitor will detect that it is running in an active peer domain, then it will use the RMC API to determine the node numbers and names of machines running in the cluster. If the administrator sets up a LoadLeveler administration file that contains OSIs spanning several peer domains then a gsmonitor daemon must be started in each domain. A gsmonitor daemon can monitor only the OSIs contained in the domain within which it is running. The administrator specifies which OSIs run the gsmonitor daemon by specifying GSMONITOR_RUNS_HERE=TRUE in the local configuration file for that OSI. The default for GSMONITOR_RUNS_HERE is False. The gsmonitor daemon should be run on one or two nodes in the peer domain. By running LoadL_GSmonitor on more than one node in a domain you will have a backup in case one of the nodes that the monitor is running on goes down. LoadL_GSmonitor subscribes to the Group Services system-defined host membership group, which is represented by the HA_GS_HOST_MEMBERSHIP Group Services keyword. This group monitors every configured node in the system partition and every node in the active peer domain. Note: 1. The Group Services routines need to be run as root, so the LoadL_GSmonitor executable must be owned by root and have the setuid permission bit enabled. 2. It will not cause a problem to run more than one LoadL_GSmonitor daemon per peer domain, this will just cause the negotiator to be notified by each running daemon. 3. For more information about the Group Services subsystem, see the RSCT Administration Guide, SA22-7889 for peer domains. 4. For more information about GSAPI, see Group Services Programming Guide and Reference, SA22-7355. Chapter 1. What is LoadLeveler? 15
  • 36. The LoadLeveler job cycle To illustrate the flow of job information through the LoadLeveler cluster, a description and sequence of diagrams have been provided. Scheduling machine Executing machine Central manager 2 3 Scheduling machine 1 Job Scheduling 4 machine Executing machine Executing machine Figure 4. High-level job flow The managing machine in a LoadLeveler cluster is known as the central manager. There are also machines that act as schedulers, and machines that serve as the executing machines. The arrows in Figure 4 illustrate the following: v Arrow 1 indicates that a job has been submitted to LoadLeveler. v Arrow 2 indicates that the scheduling machine contacts the central manager to inform it that a job has been submitted, and to find out if a machine exists that matches the job requirements. v Arrow 3 indicates that the central manager checks to determine if a machine exists that is capable of running the job. Once a machine is found, the central manager informs the scheduling machine which machine is available. v Arrow 4 indicates that the scheduling machine contacts the executing machine and provides it with information regarding the job. In this case, the scheduling and executing machines are different machines in the cluster, but they do not have to be different; the scheduling and executing machines may be the same physical machine. Figure 4 is broken down into the following more detailed diagrams illustrating how LoadLeveler processes a job. The diagrams indicate specific job states for this example, but do not list all of the possible states for LoadLeveler jobs. A complete list of job states appears in “LoadLeveler job states” on page 19. 1. Submit a LoadLeveler job: 16 TWS LoadLeveler: Using and Administering
  • 37. Central manager LoadLeveler negotiator daemon cluster 3 Scheduling machine 1 Q schedd daemon 2 Q Q Q Idle Figure 5. Job is submitted to LoadLeveler Figure 5 illustrates that the Schedd daemon runs on the scheduling machine. This machine can also have the startd daemon running on it. The negotiator daemon resides on the central manager machine. The arrows in Figure 5 illustrate the following: v Arrow 1 indicates that a job has been submitted to the scheduling machine. v Arrow 2 indicates that the Schedd daemon, on the scheduling machine, stores all of the relevant job information on local disk. v Arrow 3 indicates that the Schedd daemon sends job description information to the negotiator daemon. At this point, the submitted job is in the Idle state. 2. Permit to run: Central manager negotiator daemon 4 Scheduling machine schedd daemon Q Q Q Pending or Starting Figure 6. LoadLeveler authorizes the job Chapter 1. What is LoadLeveler? 17
  • 38. In Figure 6 on page 17, arrow 4 indicates that the negotiator daemon authorizes the Schedd daemon to begin taking steps to run the job. This authorization is called a permit to run. Once this is done, the job is considered Pending or Starting. 3. Prepare to run: Central manager negotiator daemon Scheduling machine Executing machine remote 5 startd daemon schedd daemon local Q startd daemon Q Q Pending or Starting Figure 7. LoadLeveler prepares to run the job In Figure 7, arrow 5 illustrates that the Schedd daemon contacts the startd daemon on the executing machine and requests that it start the job. The executing machine can either be a local machine (the machine to which the job was submitted) or another machine in the cluster. In this example, the local machine is not the executing machine. 4. Initiate job: Central manager negotiator daemon 8 Scheduling machine Executing machine schedd daemon startd daemon 7 6 Q Q starter 1010 1010 Q 1010 101010 Q Running Figure 8. LoadLeveler starts the job 18 TWS LoadLeveler: Using and Administering
  • 39. The arrows in Figure 8 on page 18 illustrate the following: v Arrow 6 indicates that the startd daemon on the executing machine spawns a starter process for the job. v Arrow 7 indicates that the Schedd daemon sends the starter process the job information and the executable. v Arrow 8 indicates that the Schedd daemon notifies the negotiator daemon that the job has been started and the negotiator daemon marks the job as Running. The starter forks and executes the user’s job, and the starter parent waits for the child to complete. 5. Complete job: Central manager negotiator daemon 11 Scheduling machine Executing machine schedd daemon 10 startd daemon 9 Q Q starter Q Q Complete pending or Completed Figure 9. LoadLeveler completes the job The arrows in Figure 9 illustrate the following: v Arrow 9 indicates that when the job completes, the starter process notifies the startd daemon. v Arrow 10 indicates that the startd daemon notifies the Schedd daemon. v Arrow 11 indicates that the Schedd daemon examines the information it has received, and forwards it to the negotiator daemon. At this point, the job is in Completed or Complete Pending state. LoadLeveler job states As LoadLeveler processes a job, the job moves through various states. These states are listed in Table 6 on page 20. Job states that include “Pending,” such as Complete Pending and Vacate Pending, are intermediate, temporary states. Some options on LoadLeveler interfaces are valid only for jobs in certain states. For example, the llmodify command has options that apply only to jobs that are in the Idle state, or in states that are similar to it. To determine which job states are similar to the Idle state, use the “Similar to...” column in Table 6 on page 20, which Chapter 1. What is LoadLeveler? 19
  • 40. indicates whether a particular job state is similar to the Idle, Running, or Terminating state. A dash (—) indicates that the state is not similar to an Idle, Running, or Terminating state. Table 6. Job state descriptions and abbreviations Job state Similar to Abbreviation Description Idle or in displays / Running output state? Canceled Terminating CA The job was canceled either by a user or by an administrator. Checkpointing Running CK Indicates that a checkpoint has been initiated. Completed Terminating C The job has completed. Complete Terminating CP The job is in the process of being Pending completed. Deferred Idle D The job will not be assigned to a machine until a specified date. This date may have been specified by the user in the job command file, or may have been generated by the negotiator because a parallel job did not accumulate enough machines to run the job. Only the negotiator places a job in the Deferred state. Idle Idle I The job is being considered to run on a machine, though no machine has been selected. Not Queued Idle NQ The job is not being considered to run on a machine. A job can enter this state because the associated Schedd is down, the user or group associated with the job is at its maximum maxqueued or maxidle value, or because the job has a dependency which cannot be determined. For more information on these keywords, see “Controlling the mix of idle and running jobs” on page 721. (Only the negotiator places a job in the NotQueued state.) Not Run — NR The job will never be run because a dependency associated with the job was found to be false. Pending Running P The job is in the process of starting on one or more machines. (The negotiator indicates this state until the Schedd acknowledges that it has received the request to start the job. Then the negotiator changes the state of the job to Starting. The Schedd indicates the Pending state until all startd machines have acknowledged receipt of the start request. The Schedd then changes the state of the job to Starting.) 20 TWS LoadLeveler: Using and Administering
  • 41. Table 6. Job state descriptions and abbreviations (continued) Job state Similar to Abbreviation Description Idle or in displays / Running output state? Preempted Running E The job is preempted. This state applies only when LoadLeveler uses the suspend method to preempt the job. Preempt Running EP The job is in the process of being Pending preempted. This state applies only when LoadLeveler uses the suspend method to preempt the job. Rejected Idle X The job is rejected. Reject Pending Idle XP The job did not start. Possible reasons why a job is rejected are: job requirements were not met on the target machine, or the user ID of the person running the job is not valid on the target machine. After a job leaves the Reject Pending state, it is moved into one of the following states: Idle, User Hold, or Removed. Removed Terminating RM The job was stopped by LoadLeveler. Remove Terminating RP The job is in the process of being Pending removed, but not all associated machines have acknowledged the removal of the job. Resume Pending Running MP The job is in the process of being resumed. Running Running R The job is running: the job was dispatched and has started on the designated machine. Starting Running ST The job is starting: the job was dispatched, was received by the target machine, and LoadLeveler is setting up the environment in which to run the job. For a parallel job, LoadLeveler sets up the environment on all required nodes. See the description of the “Pending” state for more information on when the negotiator or the Schedd daemon moves a job into the Starting state. System Hold Idle S The job has been put in system hold. Chapter 1. What is LoadLeveler? 21
  • 42. Table 6. Job state descriptions and abbreviations (continued) Job state Similar to Abbreviation Description Idle or in displays / Running output state? Terminated Terminating TX If the negotiator and Schedd daemons experience communication problems, they may be temporarily unable to exchange information concerning the status of jobs in the system. During this period of time, some of the jobs may actually complete and therefore be removed from the Schedd’s list of active jobs. When communication resumes between the two daemons, the negotiator will move such jobs to the Terminated state, where they will remain for a set period of time (specified by the NEGOTIATOR_REMOVE_COMPLETED keyword in the configuration file). When this time has passed, the negotiator will remove the jobs from its active list. User & System Idle HS The job has been put in both system hold Hold and user hold. User Hold Idle H The job has been put in user hold. Vacated Idle V The job started but did not complete. The negotiator will reschedule the job (provided the job is allowed to be rescheduled). Possible reasons why a job moves to the Vacated state are: the machine where the job was running was flushed, the VACATE expression in the configuration file evaluated to True, or LoadLeveler detected a condition indicating the job needed to be vacated. For more information on the VACATE expression, see “Managing job status through control expressions” on page 68. Vacate Pending Idle VP The job is in the process of being vacated. Consumable resources Consumable resources are assets available on machines in your LoadLeveler cluster. These assets are called ″resources″ because they model the commodities or services | available on machines (including CPUs, real memory, virtual memory, large page | memory, software licenses, disk space). They are considered ″consumable″ because job steps use specified amounts of these commodities when the step is running. Once the step finishes, the resource becomes available for another job step. Consumable resources which model the characteristics of a specific machine (such as the number of CPUs or the number of specific software licenses available only on that machine) are called machine resources. Consumable resources which model resources that are available across the LoadLeveler cluster (such as floating software licenses) are called floating resources. For example, consider a 22 TWS LoadLeveler: Using and Administering
  • 43. configuration with 10 licenses for a given program (which can be used on any machine in the cluster). If these licenses are defined as floating resources, all 10 can be used on one machine, or they can be spread across as many as 10 different machines. The LoadLeveler administrator can specify: v Consumable resources to be considered by LoadLeveler’s scheduling algorithms v Quantity of resources available on specific machines v Quantity of floating resources available on machines in the cluster v Consumable resources to be considered in determining the priority of executing machines v Default amount of resources consumed by a job step of a specified job class | v Whether CPU, real memory, virtual memory, or large page resources should be | enforced using AIX Workload Manager (WLM) v Whether all jobs submitted need to specify resources Users submitting jobs can specify the resources consumed by each task of a job step, or the resources consumed by the job on each machine where it runs, regardless of the number of tasks assigned to that machine. If affinity scheduling support is enabled, the CPUs requested in the consumable resources requirement of a job will be used by the scheduler to determine the number of CPUs to be allocated and attached to that job’s tasks running on machines enabled for affinity scheduling. However, if the affinity scheduling request contains the processor-core affinity option, the number of CPUs will be determined from the value specified by the task_affinity keyword instead of the CPU’s value in the consumable resources requirement. For more information on scheduling affinity, see “LoadLeveler scheduling affinity support” on page 146. Note: 1. When software licenses are used as a consumable resource, LoadLeveler does not attempt to obtain software licenses or to verify that software licenses have been obtained. However, by providing a user exit that can be invoked as a submit filter, the LoadLeveler administrator can provide code to first obtain the required license and then allow the job step to run. For more information on filtering job scripts, see “Filtering a job script” on page 76. | 2. LoadLeveler scheduling algorithms use the availability of requested | consumable resources to determine the machine or machines on which a | job will run. Consumable resources (except for CPU, real memory, virtual | memory and large page) are only used for scheduling purposes and are | not enforced. Instead, LoadLeveler’s negotiator daemon keeps track of | the consumable resources available by reducing them by the amount | requested when a job step is scheduled, and increasing them when a | consuming job step completes. 3. If a job is preempted, the job continues to use all consumable resources except for ConsumableCpus and ConsumableMemory (real memory) which are made available to other jobs. 4. When the network adapters on a machine support RDMA, the machine is automatically given a consumable resource called RDMA with an available quantity defined by the limit on the number of concurrent jobs that use RDMA. For machines with the ″Switch Network Interface for HPS″ network adapters, this limit is 4. Machines with InfiniBand adapters are given unlimited RDMA resources. Chapter 1. What is LoadLeveler? 23
  • 44. 5. When steps require RDMA, either because they request bulkxfer or because they request rcxtblocks on at least one network statement, the job is automatically given a resource requirement for 1 RDMA. Consumable resources and AIX Workload Manager | If the administrator has indicated that resources should be enforced, LoadLeveler | uses AIX Workload Manager (WLM) to give greater control over CPU, real | memory, virtual memory and large page resource allocation. WLM monitors system resources and regulates their allocation to processes running on AIX. These actions prevent jobs from interfering with each other when they have conflicting resource requirements. WLM achieves this control by creating different classes of service and allowing attributes to be specified for those classes. LoadLeveler dynamically generates WLM classes with specific resource entitlements. A single WLM class is created for each job step and the process id of that job step is assigned to that class. This is done for each node that a job step is assigned to run on. LoadLeveler then defines resource shares or limits for that class depending on the LoadLeveler enforcement policy defined. These resource shares or limits represent the job’s requested resource usage in relation to the amount of resources available on the machine. | When LoadLeveler defines multiple memory resources under one WLM class, AIX | WLM uses the following order to determine if resource limits have been exceeded: 1. Real Memory Absolute Limit 2. Virtual Memory Absolute Limit 3. Large Page Limit | 4. Real Memory shares or percent limit | Note: When real memory or CPU with either shares or percent limits are | exceeded, the job processes in that class receive a lower scheduling priority | until their utilization drops below the hard max limit. When virtual memory | or absolute real memory limits are exceeded, the job processes are killed. | When the large page limit is exceeded, any new large page requests are | denied. When the enforcement policy is shares, LoadLeveler assigns a share value to the class based on the resources requested for the job step (one unit of resource equals one share). When the job step process is running, AIX WLM dynamically calculates an appropriate resource entitlement based on the WLM class share value of the job step and the total number of shares requested by all active WLM classes. It is important to note that AIX WLM will only enforce these target percentages when the resource is under contention. When the enforcement policy is limits (soft or hard), LoadLeveler assigns a percentage value to the class based on the resources requested for the job step and the total machine resources. This resource percentage is enforced regardless of any other active WLM classes. A soft limit indicates the maximum amount of the resource that can be made available when there is contention for the resources. This maximum can be exceeded if no one else requires the resource. A hard limit indicates the maximum amount of the resource that can be made available even if there is no contention for the resources. 24 TWS LoadLeveler: Using and Administering
  • 45. | Note: A WLM class is active for the duration of a job step and is deleted when the | job step completes. There is a limit of 64 active WLM classes per machine. | Therefore, when resources are being enforced, only 64 job steps can be | running on one machine. For additional information about integrating LoadLeveler with AIX Workload Manager, see “Steps for integrating LoadLeveler with the AIX Workload Manager” on page 137. Overview of reservations Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make reservations, which specify a time period during which specific node resources are reserved for exclusive use by particular users or groups. This capability is known in the computing industry as advance reservation. Normally, jobs wait to be dispatched until the resources they require become available. Through the use of reservations, wait time can be reduced because the jobs have exclusive use of the node resources (CPUs, memory, disk drives, communication adapters, and so on) as soon as the reservation period begins. Note: Advance reservation supports Blue Gene resources including the Blue Gene compute nodes. For more information, see “Blue Gene reservation support” on page 159. In addition to reducing wait time, reservations also are useful for: v Running a workload that needs to start or finish at a particular time. The job steps must be associated with, or bound to, the reservation before LoadLeveler can run them during the reservation period. | v Reserving resources for a workload that repeats at regular intervals. You can | make a single request to create a recurring reservation, which reserves a specific | set of resources for a specific time slot that repeats on a regular basis for a | defined interval. v Setting aside a set of nodes for maintenance purposes. In this case, job steps are not bound to the reservation. Only bound job steps may run on the reserved nodes, which means that a bound job step competes for reserved resources only with other job steps that are bound to the same reservation. The following sequence of events describes, in general terms, how you can set up and use reservations in the LoadLeveler environment. It also describes how LoadLeveler manages activities related to the use of reservations. 1. Configuring LoadLeveler to support reservations An administrator uses specific keywords in the configuration and administration files to define general reservation policies. These keywords include: | v max_reservations, when used in the global configuration file defines the | maximum number of reservations for the entire cluster. | v max_reservations, when used in a user or group stanza of the administration | file can also be used to define both: – The users or groups that will be allowed to create reservations. To be authorized to create reservations, LoadLeveler administrators also must have the max_reservations keyword set in their own user or group stanzas. Chapter 1. What is LoadLeveler? 25
  • 46. – How many reservations users may own. | Note: With recurring advance reservations, to avoid confusion about what | counts as one reservation, LoadLeveler is using the approach that one | reservation counts as one instance regardless of the number of times | the reservation recurs before it expires. This applies to the system | wide max_reservations configuation setting as well as the same type | of configuration settings at the user and group levels. v max_reservation_duration, which defines the maximum duration for reservations. v reservation_permitted, which defines the nodes that may be used for reservations. | v max_reservation_expiration which defines how long recurring reservations | are permitted to last (expressed as the number of days). Administrators also may configure LoadLeveler to collect accounting data about reservations when the reservations complete or are canceled. 2. Creating reservations After LoadLeveler is configured for reservations, an administrator or authorized user may create specific reservations, defining reservation attributes that include: v The start time and the duration of the reservation. The start and end times for a reservation are based on the time-of-day (TOD) clock on the central manager machine. | v Whether or not the reservation recurs and if it recurs, the interval in which it | does so. v The nodes to be reserved. Until the reservation period actually begins, the selected nodes are available to run any jobs; when the reservation starts, only jobs bound to the reservation may run on the reserved nodes. v The users or groups that may use the reservation. LoadLeveler assigns a unique ID to the reservation, and returns that ID to the owner. After the reservation is successfully created: v Reservation owners may: – Modify, query, and cancel their reservations. – Allow other LoadLeveler users or groups to submit jobs to run during a reservation period. – Submit jobs to run during a reservation period. v Users or groups that are allowed to use the reservation also may query reservations, and submit jobs to run during a reservation period. To run jobs during a reservation period, users must bind job steps to the reservation. You may bind both batch and interactive POE job steps to a reservation. 3. Preparing for the start of a reservation During the preparation time for a reservation, LoadLeveler: v Preempts any jobs that are still running on the reserved nodes. v Checks the condition of reserved nodes, and notifies the reservation owner and LoadLeveler administrators by e-mail of any situations that might require the reservation owner or an administrator to take corrective action. Such conditions include: – Reserved nodes that are down, suspended, no longer in the LoadLeveler cluster, or otherwise unavailable for use. – Non-preemptable job steps that cannot finish running before the reservation start time. 26 TWS LoadLeveler: Using and Administering
  • 47. During this time, reservation owners may modify, cancel, and add users or groups to their reservations. Owners and users or groups that are allowed to use the reservation may query the reservation or bind job steps to it. 4. Starting the reservation When the reservation period begins, LoadLeveler dispatches job steps that are bound to the reservation. After the reservation period begins, reservation owners may modify, cancel, and add users or groups to their reservations. Owners and users or groups that are allowed to use the reservation may query the reservation or bind job steps to it. During the reservation period, LoadLeveler ignores system preemption rules for bound job steps; however, LoadLeveler administrators may use the llpreempt command to manually preempt bound job steps. When the reservation ends or is canceled: | v LoadLeveler unbinds all job steps from the reservation if there are no further | occurrences remaining. At this point the unbound job steps compete with all | other LoadLeveler jobs for available resources. If there are occurrences remaining | in the reservation, job steps are automatically bound to the next occurrence. v If accounting data is being collected for the reservation, LoadLeveler also updates the reservation history file. For more detailed information and instructions for setting up and using reservations, see: v “Configuring LoadLeveler to support reservations” on page 131. v “Working with reservations” on page 213. Fair share scheduling overview Fair share scheduling in LoadLeveler provides a way to divide resources in a LoadLeveler cluster among users or groups of users. Historic resource usage data that is collected at the time the job ends can be used to influence job priorities to achieve the resource usage proportions allocated to users or groups of users in the LoadLeveler configuration files. The resource usage data will decay over time so that the relatively recent historic resource usage will have the most influence on job priorities. The CPU resources in the cluster and the Blue Gene resources are currently supported by fair share scheduling. For information about configuring fair share scheduling in LoadLeveler, see “Using fair share scheduling” on page 160. Chapter 1. What is LoadLeveler? 27
  • 48. 28 TWS LoadLeveler: Using and Administering
  • 49. Chapter 2. Getting a quick start using the default configuration If you are very familiar with UNIX and Linux system administration and job scheduling, follow these steps to get LoadLeveler up and running on your network quickly in a default configuration. This default configuration will merely enable you to submit serial jobs; for a more complex setup, see Chapter 4, “Configuring the LoadLeveler environment,” on page 41. What you need to know before you begin LoadLeveler sets up default values for configuration information. v loadl is the recommended LoadLeveler user ID and the LoadLeveler group ID. LoadLeveler daemons run under this user ID to perform file I/O, and many LoadLeveler files are owned by this user ID. v The home directory of loadl is the configuration directory. v LoadL_config is the name of the configuration file. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Using the default configuration files Follow these steps to use the default configuration files. Note: You can find samples of the LoadL_admin and LoadL_config files in the release directory (in the samples subdirectory). 1. Ensure that the installation procedure has completed successfully and that the configuration file, LoadL_config, exists in LoadLeveler’s home directory or in the directory specified by the LoadLConfig keyword. 2. Identify yourself as the LoadLeveler administrator in the LoadL_config file using the LOADL_ADMIN keyword. The syntax of this keyword is: LOADL_ADMIN = list_of_user_names (required) Where list_of_user_names is a blank-delimited list of those individuals who will have administrative authority. Refer to “Defining LoadLeveler administrators” on page 43 for more information. 3. Define a machine to act as the LoadLeveler central manager by coding one machine stanza as follows in the administration file, which is called LoadL_admin. (Replace machine_name with the actual name of the machine.) machine_name: type = machine central_manager = true Do not specify more than one machine as the central manager. Also, if during installation, you ran llinit with the -cm flag, the central manager is already defined in the LoadL_admin file because the llinit command takes parameters that you entered and updates the administration and configuration files. See “Defining machines” on page 84 for more information. 29
  • 50. LoadLeveler for Linux quick start If you would like to quickly install and configure LoadLeveler for Linux and submit a serial job on a single node, use these procedures. Note: This setup is for a single node only and the node used for this example is: c197blade1b05.ppd.pok.ibm.com. Quick installation Details of this installation apply tor RHEL 4 System x servers. Note: This installation method is, however, applicable to all other systems. You must install the corresponding license RPM for the system you are installing on. This installation assumes that the LoadLeveler RPMs are located at: /mnt/cdrom/. 1. Log on to node c197blade1b05.ppd.pok.ibm.com as root, which is the node you are installing on. 2. Add a UNIX group for LoadLeveler users (make sure the group ID is correct) by entering the following command: groupadd -g 1000 loadl 3. Add a UNIX user for LoadLeveler (make sure the user ID is correct) by entering the following command: useradd -c "LoadLeveler User" -d /home/loadl -s /bin/bash -u 1001 -g 1000 -m loadl | 4. Install the license RPM by entering the following command: | rpm -ivh /mnt/cdrom/LoadL-full-license-RH4-X86-3.5.0.0-0.i386.rpm 5. Change to the LoadLeveler installation path by entering the following the command: cd /opt/ibmll/LoadL/sbin 6. Run the LoadLeveler installation script by entering: ./install_ll -y -d /mnt/cdrom | 7. Install the required LoadLeveler services updates for 3.5.0.1 for this RPM. | Updates and installation instructions are available at: | https://www14.software.ibm.com/webapp/set2/sas/f/loadleveler/download/ | intel.html Quick configuration Use this method to perform a quick configuration. 1. Change the log in to the newly created LoadLeveler user by entering the following command: su - loadl 2. Add the LoadLeveler bin directory to the search path: export PATH=$PATH:/opt/ibmll/LoadL/full/bin 3. Run the LoadLeveler initialization script: /opt/ibmll/LoadL/full/bin/llinit -local /tmp/loadl -release /opt/ibmll/LoadL/full -cm c197blade1b05.ppd.pok.ibm.com Quick verification Use this method to perform a quick verification. | 1. Start LoadLeveler by entering the following command: | llctl start 30 TWS LoadLeveler: Using and Administering
  • 51. | You should receive a response similar to the following: | llctl: Attempting to start LoadLeveler on host c197blade1b05.ppd.pok.ibm.com | LoadL_master 3.5.0.1 rsats001a 2008/10/29 RHEL 4.0 140 | CentralManager = c197blade1b05.ppd.pok.ibm.com | [loadl@c197blade1b05 bin]$ 2. Check LoadLeveler status by entering the following command: llstatus You should receive a response similar to the following: Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys c197blade1b05.ppd.pok.ibm Avail 0 0 Idle 0 0.00 1 i386 Linux2 i386/Linux2 1 machines 0 jobs 0 running task Total Machines 1 machines 0 jobs 0 running task The central manager is defined on c197blade1b05.ppd.pok.ibm.com The BACKFILL scheduler is in use All machines on the machine_list are present. [loadl@c197blade1b05 bin]$ 3. Submit a sample job, by entering the following command: llsubmit /opt/ibmll/LoadL/full/samples/job1.cmd You should receive a response similar to the following: llsubmit: The job "c197blade1b05.ppd.pok.ibm.com.1" with 2 job steps / has been submitted. [loadl@c197blade1b05 samples]$ 4. Display the LoadLeveler job queue, by entering the following command: llq You should receive a response similar to the following: Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- c197blade1b05.1.0 loadl 8/15 17:25 R 50 No_Class c197blade1b05 c197blade1b05.1.1 loadl 8/15 17:25 I 50 No_Class 2 job step(s) in queue, 1 waiting, 0 pending, 1 running, 0 held, 0 preempted [loadl@c197blade1b05 samples]$ 5. Check output files into the home directory (/home/loadl) by entering the following command: ls -ltr job* You should receive a response similar to the following: -rw-rw-r-- 1 loadl loadl 1940 Aug 15 17:26 job1.c197blade1b05.1.0.out -rw-rw-rw- 1 loadl loadl 1940 Aug 15 17:27 job1.c197blade1b05.1.1.out [loadl@c197blade1b05 ~]$ Post-installation considerations This information explains how to start (or restart) and stop LoadLeveler. It also tells you where files are located after you install LoadLeveler, and it points you to troubleshooting information. Starting LoadLeveler You can start LoadLeveler using any LoadLeveler administrator user ID as defined in the configuration file. To start all of the machines that are defined in machine stanzas in the administration file, enter: llctl -g start Chapter 2. Getting a quick start using the default configuration 31
  • 52. The central manager machine is the first started, followed by other machines in the order listed in the administration file. See “llctl - Control LoadLeveler daemons” on page 439 for more information. By default, llctl uses rsh to start LoadLeveler on the target machine. Other mechanisms, such as ssh can be used by setting the LL_RSH_COMMAND configuration keyword in LoadL_config. However you choose to start LoadLeveler on remote hosts, you must have the authority to run commands remotely on that host. You can verify that the machine has been properly configured by running the sample jobs in the appropriate samples directory (job1.cmd, job2.cmd, and job3.cmd). You must read the job2.cmd and job3.cmd files before submitting them because job2 must be edited and a C program must be compiled to use job3. It is a good idea to copy the sample jobs to another directory before modifying them; you must have read/write permission to the directory in which they are located. You can use the llsubmit command to submit the sample jobs from several different machines and verify that they complete (see “llsubmit - Submit a job” on page 531). If you are running AFS and some jobs do not complete, you might need to use the AFS fs command (fs listacl) to ensure that the you have write permission to the spool, execute, and log directories. If you are running with cluster security services enabled and some jobs do not complete, ensure that you have write permission to the spool, execute, and log directories. Also ensure that the user ID is authorized to run jobs on the submitting machine (the identity of the user must exist in the .rhosts file of the user on the machine on which the job is being run). Note: LoadLeveler for Linux does not support cluster security services. If you are running submit-only LoadLeveler, once the LoadLeveler pool is up and running, you can use the llsubmit, llq, and llcancel commands from the submit-only machines. For more information about these commands, see v “llsubmit - Submit a job” on page 531 v “llq - Query job status” on page 479 v “llcancel - Cancel a submitted job” on page 421 You can also invoke the LoadLeveler graphical user interface xloadl_so from the submit-only machines (see Chapter 15, “Graphical user interface (GUI) reference,” on page 403). Location of directories following installation After installation, the product directories reside on disk. The product directories that reside on disk after installation are shown in Table 7 on page 33. The installation process creates only those directories required to service the LoadLeveler options specified during the installation. For AIX, release_directory indicates /usr/lpp/LoadL/full and for Linux, it indicates /opt/ibmll/LoadL/full. 32 TWS LoadLeveler: Using and Administering
  • 53. Table 7. Location and description of product directories following installation Directory Description release_directory/bin Part of the release directory containing daemons, commands, and other binaries release_directory/lib Part of the release directory containing product libraries and resource files release_directory/man Part of the release directory containing man pages release_directory/samples Part of the release directory containing sample administration and configuration files and sample jobs release_directory/include Part of the release directory containing header files for the application programming interfaces Local directory spool, execute, and log directories for each machine in the cluster Home directory Administration and configuration files, and symbolic links to the release directory /usr/lpp/LoadL/codebase Configuration tasks for AIX Table 8 shows the location of directories for submit-only LoadLeveler: Table 8. Location and description of directories for submit-only LoadLeveler Directory Description release_directory/so/bin Part of the release directory containing commands release_directory/so/man Part of the release directory containing man pages release_directory/so/samples Part of the release directory containing sample administration and configuration files release_directory/so/lib Contains libraries and graphical user interface resource files Home directory Contains administration and configuration files If you have a mixed LoadLeveler cluster of AIX and Linux machines, you might want to make the following symbolic links: v On AIX, as root, enter: mkdir -p /opt/ibmll ln -s /usr/lpp/LoadL /opt/ibmll/LoadL v On Linux, as root, enter: mkdir -p /usr/lpp ln -s /opt/ibmll/LoadL /usr/lpp/LoadL With the addition of these symbolic links, a user application can use either /usr/lpp/LoadL or /opt/ibmll/LoadL to refer to the location of LoadLeveler files regardless of whether the application is running on AIX or Linux. If LoadLeveler will not start following installation, see “Why won’t LoadLeveler start?” on page 700 for troubleshooting information. Chapter 2. Getting a quick start using the default configuration 33
  • 54. 34 TWS LoadLeveler: Using and Administering
  • 55. Chapter 3. What operating systems are supported by LoadLeveler? LoadLeveler supports three operating systems. | v AIX 6.1 and AIX 5.3 | IBM’s AIX 6.1 and AIX 5.3 are open UNIX operating environments that conform | to The Open Group UNIX 98 Base Brand industry standard. AIX 6.1 and AIX 5.3 | provide high levels of integration, flexibility, and reliability and operate on IBM | Power Systems and IBM Cluster 1600 servers and workstations. | AIX 6.1 and AIX 5.3 support the concurrent operation of 32- and 64-bit | applications, with key internet technologies such as Java™ and XML parser for | Java included as part of the base operating system. | A strong affinity between AIX and Linux permits popular applications | developed on Linux to run on AIX 6.1 and AIX 5.3 with a simple recompilation. | v Linux | LoadLeveler supports the following distributions of Linux: | – Red Hat® Enterprise Linux (RHEL) 4 and RHEL 5 | – SUSE Linux Enterprise Server (SLES) 9 and SLES 10 v IBM System Blue Gene Solution While no LoadLeveler processes actually run on the Blue Gene machine, LoadLeveler can interact with the Blue Gene machine and supports the scheduling of jobs to the machine. Note: For models of the Blue Gene system such as Blue Gene/S, which can only run a single job at a time, LoadLeveler does not have to be configured to schedule resources for Blue Gene jobs. For such systems, serial jobs can be used to submit work to the front end node for the Blue Gene system. LoadLeveler for AIX and LoadLeveler for Linux compatibility LoadLeveler for Linux is compatible with LoadLeveler for AIX. Its command line interfaces, graphical user interfaces, and application programming interfaces (APIs) are the same as they have been for AIX. The formats of the job command file, configuration file, and administration file also remain the same. System administrators can set up and maintain a LoadLeveler cluster consisting of some machines running LoadLeveler for AIX and some machines running LoadLeveler for Linux. This is called a mixed cluster. In this mixed cluster jobs can be submitted from either AIX or Linux machines. Jobs submitted to a Linux job queue can be dispatched to an AIX machine for execution, and jobs submitted to an AIX job queue can be dispatched to a Linux machine for execution. Although the LoadLeveler products for AIX and Linux are compatible, they do have some differences in the level of support for specific features. For further details, see the following topics: v “Restrictions for LoadLeveler for Linux” on page 36. v “Features not supported in LoadLeveler for Linux” on page 36. v “Restrictions for LoadLeveler for AIX and LoadLeveler for Linux mixed clusters” on page 37. 35
  • 56. Restrictions for LoadLeveler for Linux LoadLeveler for Linux supports a subset of the features that are available in the LoadLeveler for AIX product. The following features are available, but are subject to restrictions: v 32-bit applications using the LoadLeveler APIs LoadLeveler for Linux supports only the 32-bit LoadLeveler API library (libllapi.so) on the following platforms: – RHEL 4 and RHEL 5 on IBM IA-32 xSeries® servers – SLES 9 and SLES 10 on IBM IA-32 xSeries servers Applications linked to the LoadLeveler APIs on these platforms must be 32-bit applications. v 64–bit applications using the LoadLeveler APIs LoadLeveler for Linux supports only the 64-bit LoadLeveler API library (libllapi.so) on the following platforms: – RHEL 4 and RHEL 5 on IBM xSeries servers with AMD Opteron or Intel EM64T processors – RHEL 4 and RHEL 5 on POWER™ servers – SLES 9 and SLES 10 on IBM xSeries servers with AMD Opteron or Intel EM64T processors – SLES 9 and SLES 10 on POWER servers Applications linked to the LoadLeveler APIs on these platforms must be 64-bit applications. v Support for AFS file systems LoadLeveler for Linux support for authenticated access to AFS file systems is limited to RHEL 4 on xSeries servers and IBM xSeries servers with AMD Opteron or Intel EM64T processors. It is not available on systems running SLES 9 or SLES 10. Features not supported in LoadLeveler for Linux LoadLeveler for Linux supports a subset of the features that are available in the LoadLeveler for AIX product. The following features are not supported: v RDMA consumable resource On systems with High Performance Switch adapters, RDMA consumable resources are not supported on LoadLeveler for Linux. v User context RDMA blocks User context RDMA blocks are not supported by LoadLeveler for Linux. v Checkpoint/restart LoadLeveler for AIX uses a number of features that are specific to the AIX kernel to provide support for checkpoint/restart of user applications running under LoadLeveler. Checkpoint/restart is not available in this release of LoadLeveler for Linux. v AIX Workload management (WLM) WLM can strictly control use of system resources. LoadLeveler for AIX uses WLM to enforce the use of a number of consumable resources defined by | LoadLeveler (such as ConsumableCpus, ConsumableVirtualMemory, 36 TWS LoadLeveler: Using and Administering
  • 57. | ConsumableLargePageMemory , and ConsumableMemory). This enforcement of consumable resources usage through WLM is not available in this release of LoadLeveler for Linux. v CtSec security LoadLeveler for AIX can exploit CtSec (Cluster Security Services) security functions. These functions authenticate the identity of users and programs interacting with LoadLeveler. These features are not available in this release of LoadLeveler for Linux. v LoadL_GSmonitor daemon The LoadL_GSmonitor daemon in the LoadLeveler for AIX product uses the Group Services Application Programming Interface (GSAPI) to monitor machine availability and notify the LoadLeveler central manager when a machine is no longer reachable. This daemon is not available in the LoadLeveler for Linux product. v Task guide tool v System error log Each LoadLeveler daemon has its own log file where information relevant to its operation is recorded. In addition to this feature which exists on all platforms, LoadLeveler for AIX also uses the errlog function to record critical LoadLeveler events into the AIX system log. Support for an equivalent Linux function is not available in this release. Restrictions for LoadLeveler for AIX and LoadLeveler for Linux mixed clusters | Several restrictions apply when operating a LoadLeveler cluster that contains AIX | 6.1 and AIX 5.3 and Linux machines. | When operating a LoadLeveler cluster that contains AIX 6.1 and AIX 5.3 and Linux machines, the following restrictions apply: v The central manager node must run a version of LoadLeveler equal to or higher than any LoadLeveler version being run on a node in the cluster. v CtSec security features cannot be used. v AIX jobs that use checkpointing must be sent to AIX nodes for execution. This can be done by either defining and specifying job checkpointing for job classes that exist only on AIX nodes or by coding appropriate requirements expressions. Checkpointing jobs that are sent to a Linux node will be rejected by the LoadL_startd daemon running on the Linux node. v WLM is supported in a mixed cluster. However, enforcement of the use of consumable resources will occur through WLM on AIX nodes only. Chapter 3. What operating systems are supported by LoadLeveler? 37
  • 58. 38 TWS LoadLeveler: Using and Administering
  • 59. Part 2. Configuring and managing the TWS LoadLeveler environment After installing IBM Tivoli Workload Scheduler (TWS) LoadLeveler, you may customize it by modifying both the configuration file and the administration file (see Part 1, “Overview of TWS LoadLeveler concepts and operation,” on page 1 for overview information). The configuration file contains many parameters that you can set or modify that will control how TWS LoadLeveler operates. The administration file optionally lists and defines the machines in the TWS LoadLeveler cluster and the characteristics of classes, users, and groups. To easily manage TWS LoadLeveler, you should have one global configuration file and only one administration file, both centrally located on a machine in the TWS LoadLeveler cluster. Every other machine in the cluster must be able to read the configuration and administration file that are located on the central machine. You may have multiple local configuration files that specify information specific to individual machines. TWS LoadLeveler does not prevent you from having multiple copies of administration files, but you need to be sure to update all the copies whenever you make a change to one. Having only one administration file prevents any confusion. 39
  • 60. 40 TWS LoadLeveler: Using and Administering
  • 61. Chapter 4. Configuring the LoadLeveler environment One of your main tasks as system administrator is to configure LoadLeveler. To configure LoadLeveler, you need to know what the configuration information is and where it is located. Configuration information includes the following: v The LoadLeveler user ID and group ID v The configuration directory v The global configuration file Configuring LoadLeveler involves modifying the configuration files that specify the terms under which LoadLeveler can use machines. There are two types of configuration files: v Global Configuration File: This file by default is called the LoadL_config file and it contains configuration information common to all nodes in the LoadLeveler cluster. v Local Configuration File: This file is generally called LoadL_config.local (although it is possible for you to rename it). This file contains specific configuration information for an individual node. The LoadL_config.local file is in the same format as LoadL_config and the information in this file overrides any information specified in LoadL_config. It is an optional file that you use to modify information on a local machine. Its full path name is specified in the LoadL_config file by using the LOCAL_CONFIG keyword. See “Specifying file and directory locations” on page 47 for more information. Table 9 identifies where you can find more information about using configuration and administration files to modify the TWS LoadLeveler environment. Table 9. Roadmap of tasks for TWS LoadLeveler administrators To learn about: Read the following: Controlling how TWS LoadLeveler Chapter 4, “Configuring the LoadLeveler operates by customizing the global or environment” local configuration file Controlling TWS LoadLeveler resources Chapter 5, “Defining LoadLeveler resources to by customizing an administration file administer,” on page 83 Additional ways to modify TWS Chapter 6, “Performing additional administrator LoadLeveler that require customization tasks,” on page 103 of both the configuration and administration files Ways to control or monitor TWS v Chapter 16, “Commands,” on page 411 LoadLeveler operations by using the v Chapter 7, “Using LoadLeveler’s GUI to TWS LoadLeveler commands, GUI, and perform administrator tasks,” on page 169 APIs v Chapter 17, “Application programming interfaces (APIs),” on page 541 You can run your installation with default values set by LoadLeveler, or you can change any or all of them. Table 10 on page 42 lists topics that discuss how you may configure the LoadLeveler environment by modifying the configuration file. 41
  • 62. Table 10. Roadmap of administrator tasks related to using or modifying the LoadLeveler configuration file To learn about: Read the following: Using the default Chapter 2, “Getting a quick start using the default configuration files shipped configuration,” on page 29 with LoadLeveler Modifying the global and “Modifying a configuration file” local configuration files Defining major elements of v “Defining LoadLeveler administrators” on page 43 the LoadLeveler configuration v “Defining a LoadLeveler cluster” on page 44 v “Defining LoadLeveler machine characteristics” on page 54 v “Defining security mechanisms” on page 56 v “Defining usage policies for consumable resources” on page 60 v “Steps for configuring a LoadLeveler multicluster” on page 151 Enabling optional v “Enabling support for bulk data transfer and rCxt blocks” LoadLeveler functions on page 61 v “Gathering job accounting data” on page 61 v “Managing job status through control expressions” on page 68 v “Tracking job processes” on page 70 v “Querying multiple LoadLeveler clusters” on page 71 Modifying LoadLeveler “Providing additional job-processing controls through operations through installation exits” on page 72 installation exits Modifying a configuration file By taking a look at the configuration files that come with LoadLeveler, you will find that there are many parameters that you can set. In most cases, you will only have to modify a few of these parameters. In some cases, though, depending upon the LoadLeveler nodes, network connection, and hardware availability, you may need to modify additional parameters. All LoadLeveler commands, daemons, and processes read the administration and configuration files at start up time. If you change the administration or configuration files after LoadLeveler has already started, any LoadLeveler command or process, such as the LoadL_starter process, will read the newer version of the files while the running daemons will continue to use the data from the older version. To ensure that all LoadLeveler commands, daemons, and processes use the same configuration data, run the reconfiguration command on all machines in the cluster each time the administration or configuration files are changed. To override the defaults, you must update the following keywords in the /etc/LoadL.cfg file: LoadLUserid Specifies the LoadLeveler user ID. 42 TWS LoadLeveler: Using and Administering
  • 63. LoadLGroupid Specifies the LoadLeveler group ID. LoadLConfig Specifies the full path name of the configuration file. Note that if you change the LoadLeveler user ID to something other than loadl, you will have to make sure your configuration files are owned by this ID. If Cluster Security (CtSec) services is enabled, make sure you update the unix.map file if the LoadLUserid is specified as something other than loadl. Refer to “Steps for enabling CtSec services” on page 58 for more details. You can also override the /etc/LoadL.cfg file. For an example of when you might want to do this, see “Querying multiple LoadLeveler clusters” on page 71. Before you modify a configuration file, you need to: v Ensure that the installation procedure has completed successfully and that the configuration file, LoadL_config, exists in LoadLeveler’s home directory or in the directory specified in /etc/LoadL.cfg. For additional details about installation, see TWS LoadLeveler: Installation Guide. v Know how to correctly specify keywords in the configuration file. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. v Identify yourself as the LoadLeveler administrator using the LOADL_ADMIN keyword. After you finish modifying the configuration file, notify LoadLeveler daemons by issuing the llctl command with either the reconfig or recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the configuration file. Defining LoadLeveler administrators Specify the LOADL_ADMIN keyword with a list of user names of those individuals who will have administrative authority. These users are able to invoke the administrator-only commands such as llctl, llfavorjob, and llfavoruser. These administrators can also invoke the administrator-only GUI functions. For more information, see Chapter 7, “Using LoadLeveler’s GUI to perform administrator tasks,” on page 169. LoadLeveler administrators on this list also receive mail describing problems that are encountered by the master daemon. When CtSec is enabled, the LOADL_ADMIN list is used only as a mailing list. For more information, see “Defining security mechanisms” on page 56. An administrator on a machine is granted administrative privileges on that machine. It does not grant him administrative privileges on other machines. To be an administrator on all machines in the LoadLeveler cluster, either specify your user ID in the global configuration file with no entries in the local configuration file, or specify your user ID in every local configuration file that exists in the LoadLeveler cluster. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Chapter 4. Configuring the LoadLeveler environment 43
  • 64. Defining a LoadLeveler cluster It will be necessary to define the characteristics of the LoadLeveler cluster. Table 11 lists the topics that discuss how you can define the characteristics of the LoadLeveler cluster. Table 11. Roadmap for defining LoadLeveler cluster characteristics To learn about: Read the following: Defining characteristics of v “Choosing a scheduler” specific LoadLeveler daemons v “Setting negotiator characteristics and policies” on page 45 v “Specifying alternate central managers” on page 46 Defining other cluster v “Defining network characteristics” on page 47 characteristics v “Specifying file and directory locations” on page 47 v “Configuring recording activity and log files” on page 48 v “Setting up file system monitoring” on page 54 Correctly specifying Chapter 12, “Configuration file reference,” on page 263 configuration file keywords Working with daemons and v “llctl - Control LoadLeveler daemons” on page 439 machines in a LoadLeveler v “llinit - Initialize machines in the LoadLeveler cluster” cluster on page 457 Choosing a scheduler This topic discusses the types of schedulers available, which you may specify using the configuration file keyword SCHEDULER_TYPE. For information about the configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. | LL_DEFAULT | This scheduler runs serial jobs. It efficiently uses CPU time by scheduling | jobs on what otherwise would be idle nodes (and workstations). It does | not require that users set a wall clock limit. Also, this scheduler starts, | suspends, and resumes jobs based on workload. | BACKFILL | This scheduler runs both serial and parallel jobs. The objective of | BACKFILL scheduling is to maximize the use of resources to achieve the | highest system efficiency, while preventing potentially excessive delays in | starting jobs with large resource requirements. These large jobs can run | because the BACKFILL scheduler does not allow jobs with smaller resource | requirements to continuously use up resource before the larger jobs can | accumulate enough resource to run. | The BACKFILL scheduler supports: | v The scheduling of multiple tasks per node | v The scheduling of multiple user space tasks per adapter | v The preemption of jobs | v The use of reservations | v The scheduling of inbound and outbound data staging tasks 44 TWS LoadLeveler: Using and Administering
  • 65. | v Scale-across scheduling that allows you to take advantage of | underutilized resources in a multicluster installation | These functions are not supported by the default LoadLeveler scheduler. | For more information about the BACKFILL scheduler, see “Using the | BACKFILL scheduler” on page 110. API This keyword option allows you to enable an external scheduler, such as the Extensible Argonne Scheduling sYstem (EASY). The API option is intended for installations that want to create a scheduling algorithm for parallel jobs based on site-specific requirements. For more information about external schedulers, see “Using an external scheduler” on page 115. Setting negotiator characteristics and policies You may set the following negotiator characteristics and policies. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. v Prioritize the queue maintained by the negotiator Each job step submitted to LoadLeveler is assigned a system priority number, based on the evaluation of the SYSPRIO keyword expression in the configuration file of the central manager. The LoadLeveler system priority number is assigned when the central manager adds the new job step to the queue of job steps eligible for dispatch. Once assigned, the system priority number for a job step is not changed, except under the following circumstances: – An administrator or user issues the llprio command to change the system priority of the job step. – The value set for the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword is not zero. – An administrator uses the llmodify command with the -s option to alter the system priority of a job step. – A program with administrator credentials uses the ll_modify subroutine to alter the system priority of a job step. Job steps assigned higher SYSPRIO numbers are considered for dispatch before job steps with lower numbers. For related information, see the following topics: – “Controlling the central manager scheduling cycle” on page 73. – “Setting and changing the priority of a job” on page 230. – “llmodify - Change attributes of a submitted job step” on page 464. – “ll_modify subroutine” on page 677. v Prioritize the order of executing machines maintained by the negotiator Each executing machine is assigned a machine priority number, based on the evaluation of the MACHPRIO keyword expression in the configuration file of the central manager. The LoadLeveler machine priority number is updated every time the central manager updates its machine data. Machines assigned higher MACHPRIO numbers are considered to run jobs before machines with lower numbers. For example, a machine with a MACHPRIO of 10 is considered to run a job before a machine with a MACHPRIO of 5. Similarly, a machine with a MACHPRIO of -2 would be considered to run a job before a machine with a MACHPRIO of -3. Note that the MACHPRIO keyword is valid only on the machine where the central manager is running. Using this keyword in a local configuration file has no effect. Chapter 4. Configuring the LoadLeveler environment 45
  • 66. When you use a MACHPRIO expression that is based on load average, the machine may be temporarily ordered later in the list immediately after a job is scheduled to that machine. This temporary drop in priority happens because the negotiator adds a compensating factor to the startd machine’s load average every time the negotiator assigns a job. For more information, see the NEGOTIATOR_LOADAVG_INCREMENT keyword. v Specify additional negotiator policies This topic lists keywords that were not mentioned in the previous configuration steps. Unless your installation has special requirements for any of these keywords, you can use them with their default settings. – NEGOTIATOR_INTERVAL – NEGOTIATOR_CYCLE_DELAY – NEGOTIATOR_CYCLE_TIME_LIMIT – NEGOTIATOR_LOADAVG_INCREMENT – NEGOTIATOR_PARALLEL_DEFER – NEGOTIATOR_PARALLEL_HOLD – NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL – NEGOTIATOR_REJECT_DEFER – NEGOTIATOR_REMOVE_COMPLETED – NEGOTIATOR_RESCAN_QUEUE | – SCALE_ACROSS_SCHEDULING_TIMEOUT Specifying alternate central managers In one of your machine stanzas specified in the administration file, you specified that the machine would serve as the central manager. It is possible for some problem to cause this central manager to become unusable such as network communication or software or hardware failures. In such cases, the other machines in the LoadLeveler cluster believe that the central manager machine is no longer operating. To remedy this situation, you can assign one or more alternate central managers in the machine stanza to take control. The following machine stanza example defines the machine deep_blue as an alternate central manager: # deep_blue: type=machine central_manager = alt If the primary central manager fails, the alternate central manager then becomes the central manager. The alternate central manager is chosen based upon the order in which its respective machine stanza appears in the administration file. When an alternate becomes the central manager, jobs will not be lost, but it may take a few minutes for all of the machines in the cluster to check in with the new central manager. As a result, job status queries may be incorrect for a short time. When you define alternate central managers, you should set the following keywords in the configuration file: v CENTRAL_MANAGER_HEARTBEAT_INTERVAL v CENTRAL_MANAGER_TIMEOUT In the following example, the alternate central manager will wait for 30 intervals, where each interval is 45 seconds: 46 TWS LoadLeveler: Using and Administering
  • 67. # Set a 45 second interval CENTRAL_MANAGER_HEARTBEAT_INTERVAL = 45 # Set the number of intervals to wait CENTRAL_MANAGER_TIMEOUT = 30 For more information on central manager backup, refer to “What happens if the central manager isn’t operating?” on page 708. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Defining network characteristics A port number is an integer that specifies the port to use to connect to the specified daemon. You can define these port numbers in the configuration file or the /etc/services file or you can accept the defaults. LoadLeveler first looks in the configuration file for these port numbers. If LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used. See Appendix C, “LoadLeveler port usage,” on page 741 for more information. Specifying file and directory locations The configuration file provided with LoadLeveler specifies default locations for all of the files and directories. You can modify their locations using the keywords shown in Table 12. Keep in mind that the LoadLeveler installation process installs files in these directories and these files may be periodically cleaned up. Therefore, you should not keep any files that do not belong to LoadLeveler in these directories. Managing distributed software systems is a primary concern for all system administrators. Allowing users to share file systems to obtain a single, network-wide image, is one way to make managing LoadLeveler easier. Table 12. Default locations for all of the files and directories To specify the location of the: Specify this keyword: Administration ADMIN_FILE file Local LOCAL_CONFIG configuration file Local directory The following subdirectories reside in the local directory. It is possible that the local directory and LoadLeveler’s home directory are the same. v COMM v EXECUTE v LOG v SPOOL and HISTORY Tip: To maximize performance, you should keep the log, spool, and execute directories in a local file system. Also, to measure the performance of your network, consider using one of the available products, such as Toolbox/6000. Chapter 4. Configuring the LoadLeveler environment 47
  • 68. Table 12. Default locations for all of the files and directories (continued) To specify the location of the: Specify this keyword: Release RELEASEDIR directory The following subdirectories are created during installation and they reside in the release directory. You can change their locations. v BIN v LIB Core dump You may specify alternate directories to hold core dumps for the daemons directory and starter process: v MASTER_COREDUMP_DIR v NEGOTIATOR_COREDUMP_DIR v SCHEDD_COREDUMP_DIR v STARTD_COREDUMP_DIR v GSMONITOR_COREDUMP_DIR v KBDD_COREDUMP_DIR v STARTER_COREDUMP_DIR When specifying core dump directories, be sure that the access permissions are set so the LoadLeveler daemon or process can write to the core dump directory. The permissions set for path names specified in the keywords just mentioned must allow writing by both root and the LoadLeveler ID. The permissions set for the path name specified for the STARTER_COREDUMP_DIR keyword must allow writing by root, the LoadLeveler ID, and any user who can submit LoadLeveler jobs. The simplest way to be sure the access permissions are set correctly is to set them the same as are set for the /tmp directory. If a problem with access permissions prevents a LoadLeveler daemon or process from writing to a core dump directory, then a message will be written to the log, and the daemon or process will continue using the default /tmp directory for core files. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Configuring recording activity and log files The LoadLeveler daemons and processes keep log files according to the specifications in the configuration file. Administrators can also configure the LoadLeveler daemons to store additional debugging messages in a circular buffer in memory. A number of keywords are used to describe where LoadLeveler maintains the logs and how much information is recorded in each log and buffer. These keywords, shown in Table 13 on page 49, are repeated in similar form to specify the path name of the log file, its maximum length, the size of the circular buffer, and the debug flags to be used for the log and the buffer. “Controlling the logging buffer” on page 50 describes how administrators can configure LoadLeveler to buffer debugging messages. “Controlling debugging output” on page 51 describes the events that can be reported through logging controls. 48 TWS LoadLeveler: Using and Administering
  • 69. “Saving log files” on page 53 describes the configuration keyword to use to save logs for problem diagnosis. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Table 13. Log control statements Daemon/ Log File (required) Max Length (required) Debug Control (required) Process (See note 1) (See note 2) (See note 4 on page 50) Master MASTER_LOG = MAX_MASTER_LOG = bytes [buffer MASTER_DEBUG = flags [buffer path bytes] flags] Schedd SCHEDD_LOG = MAX_SCHEDD_LOG = bytes [buffer SCHEDD_DEBUG = flags [buffer path bytes] flags] Startd STARTD_LOG = path MAX_STARTD_LOG = bytes [buffer STARTD_DEBUG = flags [buffer bytes] flags] Starter STARTER_LOG = MAX_STARTER_LOG = bytes [buffer STARTER_DEBUG = flags [buffer path bytes] flags] Negotiator NEGOTIATOR_LOG MAX_NEGOTIATOR_LOG = bytes NEGOTIATOR_DEBUG = flags = path [buffer bytes] [buffer flags] Kbdd KBDD_LOG = path MAX_KBDD_LOG = bytes [buffer KBDD_DEBUG = flags [buffer bytes] flags] GSmonitor GSMONITOR_LOG MAX_GSMONITOR_LOG = bytes GSMONITOR_DEBUG = flags = path [buffer bytes] [buffer flags] where: buffer bytes Is the size of the circular buffer. The default value is 0, which indicates that the buffer is disabled. To prevent the daemon from running out of memory, this value should not be too large. Brackets must be used to specify buffer bytes. buffer flags Indicates that messages with buffer flags in addition to messages with flags will be stored in the circular buffer in memory. The default value is blank, which indicates that the logging buffer is disabled because no additional debug flags were specified for buffering. Brackets must be used to specify buffer flags. Note: 1. When coding the path for the log files, it is not necessary that all LoadLeveler daemons keep their log files in the same directory, however, you will probably find it a convenient arrangement. 2. There is a maximum length, in bytes, beyond which the various log files cannot grow. Each file is allowed to grow to the specified length and is then saved to an .old file. The .old files are overwritten each time the log is saved, thus the maximum space devoted to logging for any one program will be twice the maximum length of its log file. The default length is 64 KB. To obtain records over a longer period of time, that do not get overwritten, you can use the SAVELOGS keyword in the local or global configuration files. See “Saving log files” on page 53 for more information on extended capturing of LoadLeveler logs. Chapter 4. Configuring the LoadLeveler environment 49
  • 70. You can also specify that the log file be started anew with every invocation of the daemon by setting the TRUNC statement to true as follows: v TRUNC_MASTER_LOG_ON_OPEN = true|false v TRUNC_STARTD_LOG_ON_OPEN = true|false v TRUNC_SCHEDD_LOG_ON_OPEN = true|false v TRUNC_KBDD_LOG_ON_OPEN = true|false v TRUNC_STARTER_LOG_ON_OPEN = true|false v TRUNC_NEGOTIATOR_LOG_ON_OPEN = true|false v TRUNC_GSMONITOR_LOG_ON_OPEN = true|false 3. LoadLeveler creates temporary log files used by the starter daemon. These files are used for synchronization purposes. When a job starts, a StarterLog.pid file is created. When the job ends, this file is appended to the StarterLog file. 4. Normally, only those who are installing or debugging LoadLeveler will need to use the debug flags, described in “Controlling debugging output” on page 51 The default error logging, obtained by leaving the right side of the debug control statement null, will be sufficient for most installations. Controlling the logging buffer LoadLeveler allows a LoadLeveler daemon to store log messages in a buffer in memory instead of writing the messages to a log file. The administrator can force the messages in this buffer to be written to the log file, when necessary, to diagnose a problem. The buffer is circular and once it is full, older messages are discarded as new messages are logged. The llctl dumplogs command is used to write the contents of the logging buffer to the appropriate log file for the Master, Negotiator, Schedd, and Startd daemons. Buffering will be disabled if either the buffer length is 0 or no additional debug flags are specified for buffering. See “Configuring recording activity and log files” on page 48 for log control statement specifications. See TWS LoadLeveler: Diagnosis and Messages Guide for additional information on TWS LoadLeveler log files. Logging buffer example: With the following configuration, the Schedd daemon will write only D_ALWAYS and D_SCHEDD messages to the ${LOG}/SchedLog log file. The following messages will be stored in the buffer: v D_ALWAYS v D_SCHEDD v D_LOCKING The maximum size of the Schedd log is 64 MB and the size of the logging buffer is 32 MB. SCHEDD_LOG = ${LOG}/SchedLog MAX_SCHEDD_LOG = 64000000 [32000000] SCHEDD_DEBUG = D_SCHEDD [D_LOCKING] To write the contents of the logging buffer to SchedLog file on the machine, issue llctl dumplogs 50 TWS LoadLeveler: Using and Administering
  • 71. To write the contents of the logging buffer to the SchedLog file on node1 in the LoadLeveler cluster, issue: llctl -h node1 dumplogs To write the contents of the logging buffers to the SchedLog files on all machines, issue: llctl -g dumplogs Note that the messages written from the logging buffer include a bracket message and a prefix to identify them easily. =======================BUFFER BEGIN======================== BUFFER: message ..... BUFFER: message ..... =======================BUFFER END========================== Controlling debugging output You can control the level of debugging output logged by LoadLeveler programs. The following flags are presented here for your information, though they are used primarily by IBM personnel for debugging purposes: D_ACCOUNT Logs accounting information about processes. If used, it may slow down the network. D_ACCOUNT_DETAIL Logs detailed accounting information about processes. If used, it may slow down the network and increase the size of log files. D_ADAPTER Logs messages related to adapters. D_AFS Logs information related to AFS credentials. D_CKPT Logs information related to checkpoint and restart D_DAEMON Logs information regarding basic daemon set up and operation, including information on the communication between daemons. D_DBX Bypasses certain signal settings to permit debugging of the processes as they execute in certain critical regions. D_EXPR Logs steps in parsing and evaluating control expressions. D_FAIRSHARE Displays messages related to fair share scheduling in the daemon logs. In the global configuration file, D_FAIRSHARE can be added to SCHEDD_DEBUG and NEGOTIATOR_DEBUG. D_FULLDEBUG Logs details about most actions performed by each daemon but doesn’t log as much activity as setting all the flags. D_HIERARCHICAL Used to enable messages relating to problems related to the transmission of hierarchical messages. A hierarchical message is sent from an originating node to lower ranked receiving nodes. D_JOB Logs job requirements and preferences when making decisions regarding whether a particular job should run on a particular machine. Chapter 4. Configuring the LoadLeveler environment 51
  • 72. D_KERNEL Activates diagnostics for errors involving the process tracking kernel extension. D_LOAD Displays the load average on the startd machine. D_LOCKING Logs requests to acquire and release locks. D_LXCPUAFNT Logs messages related to Linux CPU affinity. This flag is only valid for the startd daemon. D_MACHINE Logs machine control functions and variables when making decisions regarding starting, suspending, resuming, and aborting remote jobs. D_MUSTER Logs information related to multicluster processing. D_NEGOTIATE Displays the process of looking for a job to run in the negotiator. It only pertains to this daemon. D_PCRED Directs that extra debug should be written to a file if the setpcred() function call fails. D_PROC Logs information about jobs being started remotely such as the number of bytes fetched and stored for each job. D_QUEUE Logs changes to the job queue. D_REFCOUNT Logs activity associated with reference counting of internal LoadLeveler objects. D_RESERVATION Logs reservation information in the negotiator and Schedd daemon logs. D_RESERVATION can be added to SCHEDD_DEBUG and NEGOTIATOR_DEBUG. D_RESOURCE Logs messages about the management and consumption of resources. These messages are recorded in the negotiator log. D_SCHEDD Displays how the Schedd works internally. D_SDO Displays messages detailing LoadLeveler objects being transmitted between daemons and commands. D_SECURITY Logs information related to Cluster Security (CtSec) services identities. D_SPOOL Logs information related to the usage of databases in the LoadLeveler spool directory. D_STANZAS Displays internal information about the parsing of the administration file. D_STARTD Displays how the startd works internally. D_STARTER Displays how the starter works internally. D_STREAM Displays messages detailing socket I/O. 52 TWS LoadLeveler: Using and Administering
  • 73. D_SWITCH Logs entries related to switch activity and LoadLeveler Switch Table Object data. D_THREAD Displays the ID of the thread producing the log message. The thread ID is displayed immediately following the date and time. This flag is useful for debugging threaded daemons. D_XDR Logs information regarding External Data Representation (XDR) communication protocols. For example: SCHEDD_DEBUG = D_CKPT D_XDR Causes the scheduler to log information about checkpointing user jobs and exchange xdr messages with other LoadLeveler daemons. These flags will primarily be of interest to LoadLeveler implementers and debuggers. The LL_COMMAND_DEBUG environment variable can be set to a string of debug flags the same way as the *_DEBUG configuration keywords are set. Normally, LoadLeveler commands and APIs do not print debug messages, but with this environment variable set, the requested classes of debugging messages will be logged to stderr. For example: LL_COMMAND_DEBUG="D_ALWAYS D_STREAM" llstatus will cause the llstatus command to print out debug messages related to I/O to stderr. Saving log files By default, LoadLeveler stores only the two most recent iterations of a daemon’s log file (<daemon name>Log, and <daemon name>Log.old). Occasionally, for problem diagnosing, users will need to capture LoadLeveler logs over an extended period. Users can specify that all log files be saved to a particular directory by using the SAVELOGS keyword in a local or global configuration file. Be aware that LoadLeveler does not provide any way to manage and clean out all of those log files, so users must be sure to specify a directory in a file system with enough space to accommodate them. This file system should be separate from the one used for the LoadLeveler log, spool, and execute directories. Each log file is represented by the name of the daemon that generated it, the exact time the file was generated, and the name of the machine on which the daemon is running. When you list the contents of the SAVELOGS directory, the list of log file names looks like this: NegotiatorLogNov02.16:10:39.123456.c163n10.ppd.pok.ibm.com NegotiatorLogNov02.16:10:42.987654.c163n10.ppd.pok.ibm.com NegotiatorLogNov02.16:10:46.564123.c163n10.ppd.pok.ibm.com NegotiatorLogNov02.16:10:48.234345.c163n10.ppd.pok.ibm.com NegotiatorLogNov02.16:10:51.123456.c163n10.ppd.pok.ibm.com NegotiatorLogNov02.16:10:53.566987.c163n10.ppd.pok.ibm.com StarterLogNov02.16:09:19.622387.c163n10.ppd.pok.ibm.com StarterLogNov02.16:09:51.499823.c163n10.ppd.pok.ibm.com StarterLogNov02.16:10:30.876546.c163n10.ppd.pok.ibm.com SchedLogNov02.16:09:05.543677.c163n10.ppd.pok.ibm.com SchedLogNov02.16:09:26.688901.c163n10.ppd.pok.ibm.com SchedLogNov02.16:09:47.443556.c163n10.ppd.pok.ibm.com SchedLogNov02.16:10:12.712680.c163n10.ppd.pok.ibm.com SchedLogNov02.16:10:37.342156.c163n10.ppd.pok.ibm.com Chapter 4. Configuring the LoadLeveler environment 53
  • 74. StartLogNov02.16:09:05.697753.c163n10.ppd.pok.ibm.com StartLogNov02.16:09:26.881234.c163n10.ppd.pok.ibm.com StartLogNov02.16:09:47.231234.c163n10.ppd.pok.ibm.com StartLogNov02.16:10:12.125556.c163n10.ppd.pok.ibm.com StartLogNov02.16:10:37.961486.c163n10.ppd.pok.ibm.com For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Setting up file system monitoring You can use the file system keywords to monitor the file system space or inodes used by LoadLeveler. You can use the file system keywords to monitor the file system space or inodes used by LoadLeveler for: v Logs v Saving executables v Spool information v History files You can also use the file system keywords to take preventive action and avoid problems caused by running out of file system space or inodes. This is done by setting the frequency that LoadLeveler checks the file system free space or inodes and by setting the upper and lower thresholds that initialize system responses to the free space or inodes available. By setting a realistic span between the lower and upper thresholds, you will avoid excessive system actions. The file system monitoring keywords are: v FS_INTERVAL v FS_NOTIFY v FS_SUSPEND v FS_TERMINATE v INODE_NOTIFY v INODE_SUSPEND v INODE_TERMINATE For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Defining LoadLeveler machine characteristics You can use the following keywords to define the characteristics of machines in the LoadLeveler cluster. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. v ARCH v CLASS v CUSTOM_METRIC v CUSTOM_METRIC_COMMAND v FEATURE v GSMONITOR_RUNS_HERE v MAX_STARTERS v SCHEDD_RUNS_HERE v SCHEDD_SUBMIT_AFFINITY v STARTD_RUNS_HERE 54 TWS LoadLeveler: Using and Administering
  • 75. v START_DAEMONS v VM_IMAGE_ALGORITHM v X_RUNS_HERE Defining job classes that a LoadLeveler machine will accept There are a number of possible ways of defining job classes. The following examples illustrate possible ways of defining job classes. v Example 1 This example specifies multiple classes: Class = No_Class(2) or Class = { "No_Class" "No_Class" } The machine will only run jobs that have either defaulted to or explicitly requested class No_Class. A maximum of two LoadLeveler jobs are permitted to run simultaneously on the machine if the MAX_STARTERS keyword is not specified. See “Specifying how many jobs a machine can run” for more information on MAX_STARTERS. v Example 2 This example specifies multiple classes: Class = No_Class(1) Small(1) Medium(1) Large(1) or Class = { "No_Class" "Small" "Medium" "Large" } The machine will only run a maximum of four LoadLeveler jobs that have either defaulted to, or explicitly requested No_Class, Small, Medium, or Large class. A LoadLeveler job with class IO_bound, for example, would not be eligible to run here. v Example 3 This example specifies multiple classes: Class = B(2) D(1) or Class = { "B" "B" "D" } The machine will run only LoadLeveler jobs that have explicitly requested class B or D. Up to three LoadLeveler jobs may run simultaneously: two of class B and one of class D. A LoadLeveler job with class No_Class, for example, would not be eligible to run here. Specifying how many jobs a machine can run To specify how many jobs a machine can run, you need to take into consideration both the MAX_STARTERS keyword and the Class statement. This is described in more detail in “Defining LoadLeveler machine characteristics” on page 54. For example, if the configuration file contains these statements: Chapter 4. Configuring the LoadLeveler environment 55
  • 76. Class = A(1) B(2) C(1) MAX_STARTERS = 2 then the machine can run a maximum of two LoadLeveler jobs simultaneously. The possible combinations of LoadLeveler jobs are: v A and B v A and C v B and B v B and C v Only A, or only B, or only C If this keyword is specified together with a Class statement, the maximum number of jobs that can be run is equal to the lower of the two numbers. For example, if: MAX_STARTERS = 2 Class = class_a(1) then the maximum number of job steps that can be run is one (the Class statement defines one class). If you specify MAX_STARTERS keyword without specifying a Class statement, by default one class still exists (called No_Class). Therefore, the maximum number of jobs that can be run when you do not specify a Class statement is one. Note: If the MAX_STARTERS keyword is not defined in either the global configuration file or the local configuration file, the maximum number of jobs that the machine can run is equal to the number of classes in the Class statement. Defining security mechanisms LoadLeveler can be configured to control authentication and authorization of LoadLeveler functions by using Cluster Security (CtSec) services, a subcomponent of Reliable Scalable Cluster Technology (RSCT), which uses the host-based authentication (HBA) as an underlying security mechanism. LoadLeveler permits only one security service to be configured at a time. You can skip this topic if you do not plan to use this security feature or if you plan to use the credential forwarding provided by the llgetdce and llsetdce program pair. Refer to “Using the alternative program pair: llgetdce and llsetdce” on page 75 for more information. LoadLeveler for Linux does not support CtSec security. LoadLeveler can be enabled to interact with OpenSSL for secure multicluster communications Table 14 on page 57 lists the topics that explain LoadLeveler daemons and how you may define their characteristics and modify their behavior. 56 TWS LoadLeveler: Using and Administering
  • 77. Table 14. Roadmap of configuration tasks for securing LoadLeveler operations To learn about: Read the following: Securing LoadLeveler v “Configuring LoadLeveler to use cluster security operations using cluster services” security services v “Steps for enabling CtSec services” on page 58 v “Limiting which security mechanisms LoadLeveler can use” on page 60 Enabling LoadLeveler to secure “Steps for securing communications within a LoadLeveler multicluster communication multicluster” on page 153 with OpenSSL Correctly specifying Chapter 12, “Configuration file reference,” on page 263 configuration file keywords Configuring LoadLeveler to use cluster security services Cluster security (CtSec) services allows a software component to authenticate and authorize the identity of one of its peers or clients. When configured to use CtSec services, LoadLeveler will: v Authenticate the identity of users and programs interacting with LoadLeveler. v Authorize users and programs to use LoadLeveler services. It prevents unauthorized users and programs from misusing resources or disrupting services. To use CtSec services, all nodes running LoadLeveler must first be configured as part of a cluster running Reliable Scalable Cluster Technology (RSCT). For details on CtSec services administration, see IBM Reliable Scalable Cluster Technology: Administration Guide, SA22-7889. CtSec services are designed to use multiple security mechanisms and each security mechanism must be configured for LoadLeveler. At the present time, directions are provided only for configuring the host-based authentication (HBA) security mechanism for LoadLeveler’s use. If CtSec is configured to use additional security mechanisms that are not configured for LoadLeveler’s use, then the LoadLeveler configuration file keyword SEC_IMPOSED_MECHS must be specified. This keyword is used to limit the security mechanisms that will be used by CtSec services to only those that are configured for use by LoadLeveler. Authorization is based on user identity. When CtSec services are enabled for LoadLeveler, user identity will differ depending on the authentication mechanism in use. A user’s identity in UNIX host-based authentication is the user’s network identity which is comprised of the user name and host name, such as user_name@host. LoadLeveler uses CtSec services to authorize owners of jobs, administrators and LoadLeveler daemons to perform certain actions. CtSec services uses its own identity mapping file to map the clients’ network identity to a local identity when performing authorizations. A typical local identity is the user name without a hostname. The local identities of the LoadLeveler administrators must be added as members of the group specified by the keyword SEC_ADMIN_GROUP. The local identity of the LoadLeveler user name must be added as the sole member of the group specified by the keyword SEC_SERVICES_GROUP. The LoadLeveler Services and Administrative groups, those identified by the keywords Chapter 4. Configuring the LoadLeveler environment 57
  • 78. SEC_SERVICES_GROUP and SEC_ADMIN_GROUP, must be the same across all nodes in the LoadLeveler cluster. To ensure consistency in performing tasks which require owner, administrative or daemon privileges across all nodes in the LoadLeveler cluster, user network identities must be mapped identically across all nodes in the LoadLeveler cluster. If this is not the case, LoadLeveler authorizations may fail. Steps for enabling CtSec services It is necessary to enable LoadLeveler to use CtSec services. To enable LoadLeveler to use CtSec services, perform the following steps: 1. Include, in the Trusted Host List, the host names of all hosts with which communications may take place. If LoadLeveler tries to communicate with a host not on the Trusted Host List the message: The host identified in the credentials is not a trusted host on this system will occur. Additionally, the system administrator should ensure that public keys are manually exchanged between all hosts in the LoadLeveler cluster. Refer to IBM Reliable Scalable Cluster Technology: Administration Guide, SA22-7889 for information on setting up Trusted Host Lists and manually transferring public keys. 2. Create user IDs. Each LoadLeveler administrator and the LoadLeveler user ID need to be created, if they don’t already exist. You can do this through SMIT or the mkuser command. 3. Ensure that the unix.map file contains the correct value for the service name ctloadl which specifies the LoadLeveler user name. If you have configured LoadLeveler to use loadl as the LoadLeveler user name, either by default or by specifying loadl in the LoadLUserid keyword of the /etc/LoadL.cfg file, nothing needs to be done. The default map file will contain the ctloadl service name already assigned to loadl. If you have configured a different user name in the LoadLUserid keyword of the /etc/LoadL.cfg file, you will need to make sure that the /var/ct/cfg/unix.map file exists and that it assigns the same user name to the ctloadl service name. If the /var/ct/cfg/unix.map file does not exist, create one by copying the default map file /usr/sbin/rsct/cfg/unix.map. Do not modify the default map file. If the value of the LoadLUserid and the value associated with ctloadl are not the same a security services error which indicates a UNIX identity mismatch will occur. 4. Add entries to the global mapping file of each machine in the LoadLeveler cluster to map network identities to local identities. This file is located at: /var/ct/cfg/ctsec_map.global. If this file doesn’t yet exist, you should copy the default global mapping file to this location—don’t modify the default mapping file. The default global mapping file, which is shared among all CtSec services exploiters, is located at /usr/sbin/rsct/cfg/ctsec_map.global. See IBM Reliable Scalable Cluster Technology for AIX: Technical Reference, SA22-78900 for more information on the mapping file. When adding names to the global mapping file, enter more specific entries ahead of the other, less specific entries. Remember that you must update the global mapping file on each machine in the LoadLeveler cluster, and each mapping file has to be updated with the security services identity of each member of the administrator group, the services group, and the users. Therefore, you would have entries like this: unix:brad@mach1.pok.ibm.com=bradleyf unix:brad@mach2.pok.ibm.com=bradleyf unix:brad@mach3.pok.ibm.com=bradleyf unix:marsha@mach2.pok.ibm.com=marshab 58 TWS LoadLeveler: Using and Administering
  • 79. unix:marsha@mach3.pok.ibm.com=marshab unix:loadl@mach1.pok.ibm.com=loadl unix:loadl@mach2.pok.ibm.com=loadl unix:loadl@mach3.pok.ibm.com=loadl However, if you’re sure your LoadLeveler cluster is secure, you could specify mapping for all machines this way: unix:brad@*=bradleyf unix:marsha@*=marshab unix:loadl@*=loadl This indicates that the UNIX network identity of the users brad, marsha and loadl will map to their respective security services identities on every machine in the cluster. Refer to IBM Reliable Scalable Cluster Technology for AIX: Technical Reference, SA22-7800 for a description of the syntax used in the ctsec_map.global file. 5. Create UNIX groups. The LoadLeveler administrator group and services group need to be created for every machine in the cluster and should contain the local identities of members. This can be done either by using SMIT or the mkgroup command. For example, to create the group lladmin which lists the LoadLeveler administrators: mkgroup "users=sam,betty,loadl" lladmin These groups must be created on each machine in the LoadLeveler cluster and must contain the same entries. To create the group llsvcs which lists the identity under which LoadLeveler daemons run using the default id of loadl: mkgroup users=loadl llsvcs These groups must be created on each machine in the LoadLeveler cluster and must contain the same entries. 6. Add or update these keywords in the LoadLeveler configuration file: SEC_ENABLEMENT=CTSEC SEC_ADMIN_GROUP=name of lladmin group SEC_SERVICES_GROUP=group name that contains identities of LoadLeveler daemons The SEC_ENABLEMENT=CTSEC keyword indicates that CtSec services mechanism should be used. SEC_ADMIN_GROUP points to the name of the UNIX group which contains the local identities of the LoadLeveler administrators. The SEC_SERVICES_GROUP keyword points to the name of the UNIX group which contains the local identity of the LoadLeveler daemons. All LoadLeveler daemons run as the LoadLeveler user ID. Refer to step 5 for discussion of the administrators and services groups. 7. Update the .rhosts file in each user’s home directory. This file is used to identify which UNIX identities can run LoadLeveler jobs on the local host machine. If the file does not exist in a user’s home directory, you must create it. The .rhosts file must contain entries which specify all host and user combinations allowed to submit jobs which will run as the local user, either explicitly or through the use of wildcards. Entries in the .rhosts file are specified this way: HostNameField [UserNameField] Refer to IBM AIX Files Reference, SC23-4168 for further details about the .rhosts file format. Chapter 4. Configuring the LoadLeveler environment 59
  • 80. Tips for configuring LoadLeveler to use CtSec services: When using CtSec services for LoadLeveler, each machine in the LoadLeveler cluster must be set up properly. CtSec authenticates network identities based on trust established between individual machines in a cluster, based on local host configurations. Because of this it is possible for most of the cluster to run correctly but to have transactions from certain machines experience authentication or authorization problems. If unexpected authentication or authorization problems occur in a LoadLeveler cluster with CtSec enabled, check that the steps in “Steps for enabling CtSec services” on page 58 were correctly followed for each machine in the LoadLeveler cluster. If any machine in a LoadLeveler cluster is improperly configured to run CtSec you may see that: v Users cannot perform user tasks (such as cancel) for jobs they submitted. Either the machine the job was submitted from or the machine the user operation was submitted from (or both) do not contain mapping files for the user that specify the same security services identity. The user should attempt the operation from the same machine the job was submitted from and record the results. If the user still cannot perform a user task on a job they submitted, then they should contact the LoadLeveler administrator who should review the steps in “Steps for enabling CtSec services” on page 58. v LoadLeveler daemons fail to communicate. When LoadLeveler daemons communicate they must first authenticate each other. If the daemons cannot authenticate a message will be put in the daemon log indicating an authentication failure. Ensure the Trusted Hosts List on all LoadLeveler nodes contains the correct entries for all of the nodes in the LoadLeveler cluster. Also, make sure that the LoadLeveler Services group on all nodes of the LoadLeveler cluster contains the local identity for the LoadLeveler user name. The ctsec_map.global must contain mapping rules to map the LoadLeveler user name from every machine in the LoadLeveler cluster to the local identity for the LoadLeveler user name. An example of what may happen when daemons fail to communicate is that an alternate central manager may take over while the primary central manager is still active. This can occur when the alternate central manager does not trust the primary central manager. Limiting which security mechanisms LoadLeveler can use As more security mechanisms become available, they must be configured for LoadLeveler’s use. If there are security mechanisms configured for CtSec that are not configured for LoadLeveler’s use, then the LoadLeveler configuration file keyword SEC_IMPOSED_MECHS must specify the mechanisms configured for LoadLeveler. Defining usage policies for consumable resources The LoadLeveler scheduler can schedule jobs based on the availability of consumable resources. You can use the following keywords to configure consumable resources: v ENFORCE_RESOURCE_MEMORY 60 TWS LoadLeveler: Using and Administering
  • 81. v ENFORCE_RESOURCE_POLICY v ENFORCE_RESOURCE_SUBMISSION v ENFORCE_RESOURCE_USAGE v FLOATING_RESOURCES v RESOURCES v SCHEDULE_BY_RESOURCES For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Enabling support for bulk data transfer and rCxt blocks On systems with device drivers and network adapters that support remote direct-memory access (RDMA), LoadLeveler allows bulk data transfer for jobs that use either the Internet or user space communication protocol mode. For jobs using the Internet protocol (IP jobs), LoadLeveler does not monitor or control the use of bulk transfer. For user space jobs that request bulk transfer, however, LoadLeveler creates a consumable RDMA resource, and limits RDMA resources to only four for a single machine with Switch Network Interface for HPS network adapters. There is no limit on RDMA resources for machines with InfiniBand network adapters. You do not need to perform specific configuration or job-definition tasks to enable bulk transfer for LoadLeveler jobs that use the IP network protocol. LoadLeveler cannot affect whether IP communication uses bulk transfer; the implementation of IP where the job runs determines whether bulk transfer is supported. To enable user space jobs to use bulk data transfer, you must update the LoadLeveler configuration file to include the value RDMA in the SCHEDULE_BY_RESOURCES list for machines with Switch Network Interface for HPS network adapters. Example: SCHEDULE_BY_RESOURCES = RDMA others For additional information about using bulk data transfer and job-definition requirements, see “Using bulk data transfer” on page 188. Gathering job accounting data Your organization may have a policy of charging users or groups of users for the amount of resources that their jobs consume. You can do this using LoadLeveler’s accounting feature. Using this feature, you can produce accounting reports that contain job resource information for completed | serial and parallel job steps. You can also view job resource information on jobs that are continuing to run. The accounting record for a job step will contain separate sets of resource usage data for each time a job step is dispatched to run. For example, the accounting record for a job step that is vacated and then started again will contain two sets of resource usage data. The first set of resource usage data is for the time period when the job step was initially dispatched until the job step was vacated. The second set of resource usage data is for the time period for when the job step is dispatched after the vacate until the job step completes. Chapter 4. Configuring the LoadLeveler environment 61
  • 82. The job step’s accounting data that is provided in the llsummary short listing and in the user mail will contain only one set of resource usage data. That data will be from the last time the job step was dispatched to run. For example, the mail message for job step completion for a job step that is checkpointed with the hold (-h) option and then restarted, will contain the set of resource usage data only for the dispatch that restarted the job from the checkpoint. To obtain the resource usage data for the entire job step, use the detailed llsummary command or accounting API. The following keywords allow you to control accounting functions: v ACCT v ACCT_VALIDATION v GLOBAL_HISTORY v HISTORY_PERMISSION v JOB_ACCT_Q_POLICY v JOB_LIMIT_POLICY For example, the following section of the configuration file specifies that the accounting function is turned on. It also identifies the default module used to perform account validation and the directory containing the global history files: ACCT = A_ON A_VALIDATE ACCT_VALIDATION = $(BIN)/llacctval GLOBAL_HISTORY = $(SPOOL) Table 15 lists the topics related to configuring, gathering and using job accounting data. Table 15. Roadmap of tasks for gathering job accounting data To learn about: Read the following: Configuring LoadLeveler to v “Collecting job resource data on serial and parallel jobs” gather job accounting data v “Collecting job resource data based on machines” on page 64 v “Collecting job resource data based on events” on page 64 v “Collecting job resource information based on user accounts” on page 65 v “Collecting accounting data for reservations” on page 63 v “Collecting the accounting information and storing it into files” on page 66 v “64-bit support for accounting functions” on page 67 v “Example: Setting up job accounting files” on page 67 Managing accounting data v “Producing accounting reports” on page 66 v “Correlating AIX and LoadLeveler accounting records” on page 66 v “llacctmrg - Collect machine history files” on page 413 v “llsummary - Return job resource information for accounting” on page 535 Correctly specifying Chapter 12, “Configuration file reference,” on page 263 configuration file keywords Collecting job resource data on serial and parallel jobs | Information on completed serial and parallel job steps is gathered using the UNIX | wait3 system call. 62 TWS LoadLeveler: Using and Administering
  • 83. Information on non-completed serial and parallel jobs is gathered in a platform-dependent manner by examining data from the UNIX process. | Accounting information on a completed serial job step is determined by | accumulating resources consumed by that job on the machines that ran the job. | Similarly, accounting information on completed parallel job steps is gathered by | accumulating resources used on all of the nodes that ran the job step. You can also view resource consumption information on serial and parallel jobs that are still running by specifying the -x option of the llq command. To enable llq -x, specify the following keywords in the configuration file: v ACCT = A_ON A_DETAIL v JOB_ACCT_Q_POLICY = number | Collecting accounting information for recurring jobs | For recurring jobs, accounting records are written as each occurrence of each step | of the job completes. The reservation ID field in the accounting record can be used | to distinguish one occurrence from another. Collecting accounting data for reservations LoadLeveler can collect accounting data for reservations, which are set periods of time during which node resources are reserved for the use of particular users or groups. To enable recording of reservation information, specify the following keywords in the configuration file: v To turn on accounting for reservations, add the A_RES flag to the ACCT keyword. v To specify a file other than the default history file to contain the data, use the RESERVATION_HISTORY keyword. See Chapter 12, “Configuration file reference,” on page 263 for details about the ACCT and RESERVATION_HISTORY keywords. When these keyword values are set and a reservation ends or is canceled, LoadLeveler records the following information: v The reservation ID v The time at which the reservation was created v The user ID of the reservation owner v The name of the owning group v Requested and actual start times v Requested and actual duration v Actual time at which the reservation ended or was canceled v Whether the reservation was created with the SHARED or REMOVE_ON_IDLE options v A list of users and a list of groups that were authorized to use the reservation v The number of reserved nodes v The names of reserved nodes This reservation information is appended in a single line to the reservation history file for the reservation. The format of reservation history data is: Reservation ID!Reservation Creation Time!Owner!Owning Group!Start Time! Actual Start Time!Duration!Actual Duration!Actual End Time!SHARED(yes|no)! REMOVE_ON_IDLE(yes|no)!Users!Groups!Number of Nodes!Nodes!BG C-nodes! BG Connection!BG Shape!Number of BG BPs!BG BPs In reservation history data: Chapter 4. Configuring the LoadLeveler environment 63
  • 84. v The unit of measure for start times and end times is the number of seconds since January 1, 1970. v The unit of time for durations is seconds. | Note: As each occurrence of a recurring reservation completes, an accounting | record is appended to the reservation history file. The format of the record is | identical to that of a one time reservation. In the record, the Reservation ID | includes the occurrence ID of the completed reservation. | When you cancel the entire recurring reservation (as opposed to only one | occurrence being canceled), one additional accounting record is written. This | record is based on the state of the reservation: | v If an occurrence is ACTIVE, then the end time and duration of that | occurrence is set and an accounting record written. | v If there are not any ACTIVE occurrences, then an accounting record will | be written for the next scheduled occurrence. This is similar to the | accounting record that is written when you cancel a one time reservation | in the WAITING state. The following is an example of a reservation history file entry: bgldd1.rchland.ibm.com.68.r!1150242970!ezhong!group1!1150243200!1150243200! 300!300!1150243500!no!no!yang!fvt,dev!1!bgldd1!0!!!0! bgldd1.rchland.ibm.com.54.r!1150143472!ezhong!No_Group!1153612800!0!60!0! 1150243839!no!no!!!0!32!MESH!0x0x0!1!R010(J115) bgldd1.rchland.ibm.com.70.r!1150244654!ezhong!No_Group!1150244760!1150244760! 60!60!1150244820!yes!yes!user1,user2!group1,group2!0!512!MESH!1x1x1!1!R010 | To collect the reservation information stored in the history file, use the llacctmrg | command with the -R option. For llacctmrg command syntax, see “llacctmrg - | Collect machine history files” on page 413. To format reservation history data contained in a file, use the sample script llreshist.pl in directory ${RELEASEDIR}/samples/llres/. Collecting job resource data based on machines LoadLeveler can collect job resource usage information for every machine on which a job may run. A job may run on more than one machine because it is a parallel job or because the job is vacated from one machine and rescheduled to another machine. To enable recording of resources by machine, you need to specify ACCT = A_ON A_DETAIL in the configuration file. The machine’s speed is part of the data collected. With this information, an installation can develop a charge back program which can charge more or less for resources consumed by a job on different machines. For more information on a machine’s speed, refer to the machine stanza information. See “Defining machines” on page 84. Collecting job resource data based on events In addition to collecting job resource information based upon machines used, you can gather this information based upon an event or time that you specify. 64 TWS LoadLeveler: Using and Administering
  • 85. For example, you may want to collect accounting information at the end of every work shift or at the end of every week or month. To collect accounting information on all machines in this manner, use the llctl command with the capture parameter: llctl -g capture eventname eventname is any string of continuous characters (no white space is allowed) that defines the event about which you are collecting accounting data. For example, if you were collecting accounting data on the graveyard work shift, your command could be: llctl -g capture graveyard This command allows you to obtain a snapshot of the resources consumed by active jobs up to and including the moment when you issued the command. If you want to capture this type of information on a regular basis, you can set up a crontab entry to invoke this command regularly. For example: # sample crontab for accounting # shift crontab 94/8/5 # # Set up three shifts, first, second, and graveyard shift. # Crontab entries indicate the end of shift. # #M H d m day command # 00 08 * * * /u/loadl/bin/llctl -g capture graveyard 00 16 * * * /u/loadl/bin/llctl -g capture first 00 00 * * * /u/loadl/bin/llctl -g capture second For more information on the llctl command, refer to “llctl - Control LoadLeveler daemons” on page 439. For more information on the collection of accounting records, see “llq - Query job status” on page 479. Collecting job resource information based on user accounts If your installation is interested in keeping track of resources used on an account basis, you can require all users to specify an account number in their job command files. They can specify this account number with the account_no keyword which is explained in detail in “Job command file keyword descriptions” on page 359. Interactive POE jobs can specify an account number using the LOADL_ACCOUNT_NO environment variable. LoadLeveler validates this account number by comparing it against a list of account numbers specified for the user in the user stanza in the administration file. Account validation is under the control of the ACCT keyword in the configuration file. The routine that performs the validation is called llacctval. You can supply your own validation routine by specifying the ACCT_VALIDATION keyword in the configuration file. The following are passed as character string arguments to the validation routine: v User name v User’s login group name v Account number specified on the Job v Blank-separated list of account numbers obtained from the user’s stanza in the administration file. The account validation routine must exit with a return code of zero if the validation succeeds. If it fails, the return code is a nonzero number. Chapter 4. Configuring the LoadLeveler environment 65
  • 86. Collecting the accounting information and storing it into files LoadLeveler stores the accounting information that it collects in a file called history in the spool directory of the machine that initially scheduled this job, the Schedd machine. Data on parallel jobs are also stored in the history files. Resource information collected on the LoadLeveler job is constrained by the capabilities of the wait3 system call. Information for processes which fork child processes will include data for those child processes as long as the parent process waits for the child process to terminate. Complete data may not be collected for jobs which are not composed of simple parent/child processes. For example, if you have a LoadLeveler job which invokes an rsh command to execute a function on another machine, the resources consumed on the other machine will not be collected as part of the LoadLeveler accounting data. LoadLeveler accounting uses the following types of files: v The local history file which is local to each Schedd machine is where job resource information is first recorded. These files are usually named history and are located in the spool directory of each Schedd machine, but you may specify an alternate name with the HISTORY keyword in either the global or local configuration file. v The global history file is a combination of the history files from some or all of the machines in the LoadLeveler cluster merged together. The command llacctmrg is used to collect files together into a global file. As the files are collected from each machine, the local history file for that machine is reset to contain no data. The file is named globalhist.YYYYMMDDHHmm. You may specify the directory in which to place the file when you invoke the llacctmrg command or you can specify the directory with the GLOBAL_HISTORY keyword in the configuration file. The default value set up in the sample configuration file is the local spool directory. Producing accounting reports You can produce three types of reports using either the local or global history file. These reports are called the short, long, and extended versions. As their names imply, the short version of the report is a brief listing of the resources used by LoadLeveler jobs. The long version provides more comprehensive detail with summarized resource usage, and the extended version of the report provides the comprehensive detail with detailed resource usage. If you do not specify a report type, you will receive the default short version. The short report displays the number of jobs along with the total CPU usage according to user, class, group, and account number. The extended version of the report displays all of the data collected for every job. v For examples of the short and extended versions of the report, see “llsummary - Return job resource information for accounting” on page 535. v For information on the accounting APIs, refer to Chapter 17, “Application programming interfaces (APIs),” on page 541. Correlating AIX and LoadLeveler accounting records For jobs running on AIX systems, you can use a job accounting key to correlate AIX accounting records with LoadLeveler accounting records. The job accounting key uniquely identifies each job step. LoadLeveler derives this key from the job key and the date and time at which the job entered the queue 66 TWS LoadLeveler: Using and Administering
  • 87. (see the QDate variable description). The key is associated with the starter process for the job step and any of its child processes. For checkpointed jobs, LoadLeveler does not change the job accounting key, regardless of how it restarts the job step. Jobs restarted from a checkpoint file or through a new job step retain the job accounting key for the original job step. To access the job accounting key for a job step, you can use the following interfaces: v The llsummary command, requesting the long version of the report. For details about using this command, see “llsummary - Return job resource information for accounting” on page 535. v The GetHistory subroutine. For details about using this subroutine, see “GetHistory subroutine” on page 545. v The ll_get_data subroutine, through the LL_StepAcctKey specification. For details about using this subroutine, see “ll_get_data subroutine” on page 570. For information about AIX accounting records, see the system accounting topic in AIX System Management Guide: Operating System and Devices. 64-bit support for accounting functions LoadLeveler 64-bit support for accounting functions includes several features. LoadLeveler 64-bit support for accounting functions includes: v Statistics of jobs such as usage, limits, consumable resources, and other 64-bit integer data are preserved in the history file as rusage64, rlimit64 structures and as data items of type int64_t. v The LL_job_step structure defined in llapi.h allows access to the 64-bit data items either as data of type int64_t or as data of type int32_t. In the latter case, the returned values may be truncated. v The llsummary command displays 64-bit information where appropriate. v The data access API supports both 64-bit and 32-bit access to accounting and usage information in a history file. See “Examples of using the data access API” on page 633 for an example of how to use the ll_get_data subroutine to access information stored in a LoadLeveler history file. Example: Setting up job accounting files You can perform all of the steps included in this sample procedure or just the ones that apply to your situation. The sample procedure shown in Table 16 walks you through the process of collecting account data. 1. Edit the configuration file according to the following table: Table 16. Collecting account data - modifying the configuration file Edit this keyword: To: ACCT Turn accounting and account validation on and off and specify detailed accounting. ACCT_VALIDATION Specify the account validation routine. GLOBAL_HISTORY Specify a directory in which to place the global history files. Chapter 4. Configuring the LoadLeveler environment 67
  • 88. 2. Specify account numbers and set up account validation by performing the following steps: a. Specify a list of account numbers a user may use when submitting jobs, by using the account keyword in the user stanza in the administration file. b. Instruct users to associate an account number with their job, by using the account_no keyword in the job command file. c. Specify the ACCT_VALIDATION keyword in the configuration file that identifies the module that will be called to perform account validation. The default module is called llacctval. You can replace this module with your installation’s own accounting routine by specifying a new module with this keyword. 3. Specify machines and their weights by using the speed keyword in a machine’s machine stanza in the administration file. Also, if you have in your cluster machines of differing speeds and you want LoadLeveler accounting information to be normalized for these differences, specify cpu_speed_scale=true in each machine’s respective machine stanza. For example, suppose you have a cluster of two machines, called A and B, where Machine B is three times as fast as Machine A. Machine A has speed=1.0, and Machine B has speed=3.0. Suppose a job runs for 12 CPU seconds on Machine A. The same job runs for 4 CPU seconds on Machine B. When you specify cpu_speed_scale=true, the accounting information collected on Machine B for that job shows the normalized value of 12 CPU seconds rather than the actual 4 CPU seconds. 4. Merge multiple files collected from each machine into one file, using the llacctmrg command. 5. Report job information on all the jobs in the history file, using the llsummary command. Managing job status through control expressions You can control running jobs by using five control functions as Boolean expressions in the configuration file. These functions are useful primarily for serial jobs. You define the expressions, using normal C conventions, with the following functions: v START v SUSPEND v CONTINUE v VACATE v KILL The expressions are evaluated for each job running on a machine using both the job and machine attributes. Some jobs running on a machine may be suspended while others are allowed to continue. The START expression is evaluated twice; once to see if the machine can accept jobs to run and second to see if the specific job can be run on the machine. The other expressions are evaluated after the jobs have been dispatched and in some cases, already running. When evaluating the START expression to determine if the machine can accept jobs, Class != ″Z″ evaluates to true only if Z is not in the class definition. This means that if two different classes are defined on a machine, Class != ″Z″ (where Z 68 TWS LoadLeveler: Using and Administering
  • 89. is one of the defined classes) always evaluates to false when specified in the START expression and, therefore, the machine will not be considered to start jobs. Typically, machine load average, keyboard activity, time intervals, and job class are used within these various expressions to dynamically control job execution. For additional information about: v Time-related variables that you may use for this keyword, see “Variables to use for setting times” on page 320. v Coding these control expressions in the configuration file, see Chapter 12, “Configuration file reference,” on page 263. How control expressions affect jobs After LoadLeveler selects a job for execution, the job can be in any of several states. Figure 10 on page 70 shows how the control expressions can affect the state a job is in. The rectangles represent job or daemon states (Idle, Completed, Running, Suspended, and Vacating) and the diamonds represent the control expressions (Start, Suspend, Continue, Vacate, and Kill). Chapter 4. Configuring the LoadLeveler environment 69
  • 90. Idle False Completed Start True Running False Suspend True Suspended True Continue False False Vacate True Vacating False Kill True Figure 10. How control expressions affect jobs Criteria used to determine when a LoadLeveler job will enter Start, Suspend, Continue, Vacate, and Kill states are defined in the LoadLeveler configuration files and they can be different for each machine in the cluster. They can be modified to meet local requirements. Tracking job processes When a job terminates, its orphaned processes may continue to consume or hold resources, thereby degrading system performance, or causing jobs to hang or fail. Process tracking allows LoadLeveler to cancel any processes (throughout the entire cluster), left behind when a job terminates. Process tracking is required to do preemption by the suspend method when running either the BACKFILL or API schedulers. Process tracking is optional in all other cases. 70 TWS LoadLeveler: Using and Administering
  • 91. When process tracking is enabled, all child processes are terminated when the main process terminates. This will include any background or orphaned processes started in the prolog, epilog, user prolog, and user epilog. Process tracking on LoadLeveler for Linux is supported only on RHEL 5 and SLES 10 systems. There are two keywords used in specifying process tracking: PROCESS_TRACKING To activate process tracking, set PROCESS_TRACKING=TRUE in the LoadLeveler global configuration file. By default, PROCESS_TRACKING is set to FALSE. PROCESS_TRACKING_EXTENSION On AIX, this keyword specifies the path to the loadable kernel module LoadL_pt_ke in the local or global configuration file. If the PROCESS_TRACKING_EXTENSION keyword is not supplied, then LoadLeveler will search the $HOME/bin default directory. On Linux, this keyword specifies the path to the loadable kernel module proctrk.ko in the local or global configuration file. The proctrk.ko kernel module is shipped as source code and must be built and installed on all machines where process tracking is required. See the TWS LoadLeveler: Installation Guide for additional information about which directory to specify when using the PROCESS_TRACKING_EXTENSION configuration keyword. The process tracking kernel extension is not unloaded when the startd daemon terminates. Therefore if a mismatch in the version of the loaded kernel extension and the installed kernel extension is found when the startd starts up the daemon will exit. In this case a reboot of the node is needed to unload the currently loaded kernel extension. If you install a new version of LoadLeveler which contains a new version of the kernel extension you may need to reboot the node. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Querying multiple LoadLeveler clusters This topic applies only to those installations having more than one LoadLeveler cluster, where the separate clusters have not been organized into a multicluster environment. To organize separate LoadLeveler clusters into a multicluster environment, see “LoadLeveler multicluster support” on page 148. You can query, submit, or cancel jobs in multiple LoadLeveler clusters by setting up a master configuration file for each cluster and using the LOADL_CONFIG environment variable to specify the name of the master configuration file that the LoadLeveler commands must use. The master configuration file must be located in the /etc directory and the file name must have a format of base_name.cfg where base_name is a user defined identifier for the cluster. The default name for the master configuration file is /etc/LoadL.cfg. The format for the LOADL_CONFIG environment variable is LOADL_CONFIG=/etc/ Chapter 4. Configuring the LoadLeveler environment 71
  • 92. base_name.cfg or LOADL_CONFIG=base_name. When you use the form LOADL_CONFIG=base_name, the prefix /etc and suffix .cfg are appended to the base_name. The following example explains how you can set up a machine to query multiple clusters: You can configure /etc/LoadL.cfg to point to the configuration files for the ″default″ cluster, and you can configure /etc/othercluster.cfg to point to the configuration files of another cluster which the user can select. For example, you can enter the following query command: $ llq The llq command uses the configuration from /etc/LoadL.cfg and queries job information from the ″default″ cluster. To send a query to the cluster defined in the configuration file of /etc/othercluster.cfg, enter: $ env LOADL_CONFIG=othercluster llq Note that the machine from which you issue the llq command is considered as a submit-only machine by the other cluster. Handling switch-table errors Configuration file keywords can be used to control how LoadLeveler responds to switch-table errors. You may use the following configuration file keywords to control how LoadLeveler responds to switch-table errors: v ACTION_ON_SWITCH_TABLE_ERROR v DRAIN_ON_SWITCH_TABLE_ERROR v RESUME_ON_SWITCH_TABLE_ERROR_CLEAR For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. Providing additional job-processing controls through installation exits LoadLeveler allows administrators to further configure the environment through installation exits. Table 17 lists these additional job-processing controls. Table 17. Roadmap of administrator tasks accomplished through installation exits To learn about: Read the following: Writing a program to control when jobs “Controlling the central manager scheduling are scheduled to run cycle” on page 73 Writing a pair of programs to override “Handling DCE security credentials” on page 74 the default LoadLeveler DCE authentication method Writing a program to refresh an AFS “Handling an AFS token” on page 75 token when a job starts 72 TWS LoadLeveler: Using and Administering
  • 93. Table 17. Roadmap of administrator tasks accomplished through installation exits (continued) To learn about: Read the following: Writing a program to check or modify “Filtering a job script” on page 76 job requests when they are submitted Writing programs to run before and “Writing prolog and epilog programs” on page 77 after job requests Overriding the LoadLeveler default “Using your own mail program” on page 81 mail notification method Defining a cluster metric to determine See the CLUSTER_METRIC configuration where a remote job is distributed keyword description in Chapter 12, “Configuration file reference,” on page 263. Defining cluster user mapper for See the CLUSTER_USER_MAPPER configuration multicluster environment keyword description in Chapter 12, “Configuration file reference,” on page 263. Correctly specifying configuration file Chapter 12, “Configuration file reference,” on page keywords 263 Controlling the central manager scheduling cycle To determine when to run the LoadLeveler scheduling algorithm, the central manager uses the values set in the configuration file for the NEGOTIATOR_INTERVAL and the NEGOTIATOR_CYCLE_DELAY keywords. The central manager will run the scheduling algorithm every NEGOTIATOR_INTERVAL seconds, unless some event takes place such as the completion of a job or the addition of a machine to the cluster. In such cases, the scheduling algorithm is run immediately. When NEGOTIATOR_CYCLE_DELAY is set, a minimum of NEGOTIATOR_CYCLE_DELAY seconds will pass between the central manager’s scheduling attempts, regardless of what other events might take place. When the NEGOTIATOR_INTERVAL is set to zero, the central manager will not run the scheduling algorithm until instructed to do so by an authorized process. This setting enables your program to control the central manager’s scheduling activity through one of the following: v The llrunscheduler command. v The ll_run_scheduler subroutine. Both the command and the subroutine instruct the central manager to run the scheduling algorithm. You might choose to use this setting if, for example, you want to write a program that directly controls the assignment of the system priority for all LoadLeveler jobs. In this particular case, you would complete the following steps to control system priority assignment and the scheduling cycle: 1. Decide the following: v Which system priority value to assign to jobs from specific sources or with specific resource requirements. v How often the central manager should run the scheduling algorithm. Your program has to be designed to issue the ll_run_scheduler subroutine at regular intervals; otherwise, LoadLeveler will not attempt to schedule any job steps. You also need to understand how changing the system priority affects the job queue. After you successfully use the ll_modify subroutine or the llmodify command to change system priority values, LoadLeveler will not readjust the values for those job steps when the negotiator recalculates priorities at regular Chapter 4. Configuring the LoadLeveler environment 73
  • 94. intervals set through the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword. Also, you can change the system priority for jobs only when those jobs are in the Idle state or a state similar to it. To determine which job states are similar to the Idle state or to the Running state, see the table in “LoadLeveler job states” on page 19. 2. Code a program to use LoadLeveler APIs to perform the following functions: a. Use the Data Access APIs to obtain data about all jobs. b. Determine whether jobs have been added or removed. c. Use the ll_modify subroutine to set the system priority for the LoadLeveler jobs. The values you set through this subroutine will not be readjusted when the negotiator recalculates job step priorities. d. Use the ll_run_scheduler subroutine to instruct the central manager to run the scheduling algorithm. e. Set a timer for the scheduling interval, to repeat the scheduling instruction at regular intervals. This step is required to replace the effect of setting the configuration keyword NEGOTIATOR_CYCLE_DELAY, which LoadLeveler ignores when NEGOTIATOR_INTERVAL is set to zero. 3. In the configuration file, set values for the following keywords: v Set the NEGOTIATOR_INTERVAL keyword to zero to stop the central manager from automatically recalculating system priorities for jobs. v (Optional) Set the SYSPRIO_THRESHOLD_TO_IGNORE_STEP keyword to specify a threshold value. If the system priority assigned to a job step is less than this threshold value, the job will remain idle. 4. Issue the llctl command with either the reconfig or recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the configuration file. 5. (Optional) To make sure that the central manager’s automatic scheduling activity has been disabled (by setting the NEGOTIATOR_INTERVAL keyword to zero), use the llstatus command. 6. Run your program under a user ID with administrator authority. Once this procedure is complete, you might want to use one or more of the following commands to make sure that jobs are scheduled according to the correct system priority. The value of q_sysprio in command output indicates the system priority for the job step. v Use the command llq -s to detect whether a job step is idle because its system priority is below the value set for the SYSPRIO_THRESHOLD_TO_IGNORE_STEP keyword. v Use the command llq -l to display the previous system priority for a job step. v When unusual circumstances require you to change system priorities manually: 1. Use the command llmodify -s to set the system priority for LoadLeveler jobs. The values you set through this command will not be readjusted when the negotiator recalculates job step priorities. 2. Use the llrunscheduler command to instruct the central manager to run the scheduling algorithm. Handling DCE security credentials You can write a pair of programs to override the default LoadLeveler DCE authentication method. To enable the programs, use the DCE_AUTHENTICATION_PAIR keyword in your configuration file: 74 TWS LoadLeveler: Using and Administering
  • 95. v As an alternative, you can also specify the program pair: DCE_AUTHENTICATION_PAIR = $(BIN)/llgetdce, $(BIN)/llsetdce Specifying the DCE_AUTHENTICATION_PAIR keyword enables LoadLeveler support for forwarding DCE credentials to LoadLeveler jobs. You may override the default function provided by LoadLeveler to establish DCE credentials by substituting your own programs. Using the alternative program pair: llgetdce and llsetdce The program pair, llgetdce and llsetdce, forwards DCE credentials by copying credential cache files from the submitting machine to the executing machines. While this technique may require less overhead, it has been known to produce credentials on the executing machines which are not fully capable of being forwarded by rsh commands. This is the only pair of programs offered in earlier releases of LoadLeveler. Forwarding DCE credentials An example of a credentials object is a character string containing the DCE principle name and a password. program1 writes the following to standard output: v The length of the handle to follow v The handle If program1 encounters errors, it writes error messages to standard error. program2 receives the following as standard input: v The length of the handle to follow v The same handle written by program1 program2 writes the following to standard output: v The length of the login context to follow v An exportable DCE login context, which is the idl_byte array produced from the sec_login_export_context DCE API call. For more information, see the DCE Security Services API chapter in the Distributed Computing Environment for AIX: Application Development Reference. v A character string suitable for assigning to the KRB5CCNAME environment variable This string represents the location of the credentials cache established in order for program2 to export the DCE login context. If program2 encounters errors, it writes error messages to standard error. The parent process, the LoadLeveler starter process, writes those messages to the starter log. For examples of programs that enable DCE security credentials, see the samples/lldce subdirectory in the release directory. Handling an AFS token You can write a program, run by the scheduler, to refresh an AFS token when a job is started. To invoke the program, use the AFS_GETNEWTOKEN keyword in your configuration file. Chapter 4. Configuring the LoadLeveler environment 75
  • 96. Before running the program, LoadLeveler sets up standard input and standard output as pipes between the program and LoadLeveler. LoadLeveler also sets up the following environment variables: LOADL_STEP_OWNER The owner (UNIX user name) of the job LOADL_STEP_COMMAND The name of the command the user’s job step invokes. LOADL_STEP_CLASS The class this job step will run. LOADL_STEP_ID The step identifier, generated by LoadLeveler. LOADL_JOB_CPU_LIMIT The number of CPU seconds the job is limited to. LOADL_WALL_LIMIT The number of wall clock seconds the job is limited to. LoadLeveler writes the following current AFS credentials, in order, over the standard input pipe: v The ktc_principal structure indicating the service. v The ktc_principal structure indicating the client. v The ktc_token structure containing the credentials. The ktc_principal structure is defined in the AFS header file afs_rxkad.h. The ktc_token structure is defined in the AFS header file afs_auth.h. LoadLeveler expects to read these same structures in the same order from the standard output pipe, except these should be refreshed credentials produced by the installation exit. The installation exit can modify the passed credentials (to extend their lifetime) and pass them back, or it can obtain new credentials. LoadLeveler takes whatever is returned and uses it to authenticate the user prior to starting the user’s job. Filtering a job script You can write a program to filter a job script when the job is submitted to the local cluster and when the job is submitted from a remote cluster. This program can, for example, modify defaults or perform site specific verification of parameters. To invoke the local job filter, specify the SUBMIT_FILTER keyword in your configuration file. To invoke the remote job filter, specify the CLUSTER_REMOTE_JOB_FILTER keyword in your configuration file. For more information on these keywords, see the SUBMIT_FILTER or CLUSTER_REMOTE_JOB_FILTER keyword in Chapter 12, “Configuration file reference,” on page 263. LoadLeveler sets the following environment variables when the program is invoked: LOADL_ACTIVE LoadLeveler version LOADL_STEP_COMMAND Job command file name LOADL_STEP_ID The job identifier, generated by LoadLeveler LOADL_STEP_OWNER The owner (UNIX user name) of the job 76 TWS LoadLeveler: Using and Administering
  • 97. For details about specific keyword syntax and use in the configuration file, see Chapter 12, “Configuration file reference,” on page 263. Writing prolog and epilog programs An administrator can write prolog and epilog installation exits that can run before and after a LoadLeveler job runs, respectively. Prolog and epilog programs fall into two types: v Those that run as the LoadLeveler user ID. v Those that run in a user’s environment. Depending on the type of processing you want to perform before or after a job runs, specify one or more of the following configuration file keywords, in any combination: v To run a prolog or epilog program under the LoadLeveler user ID, specify JOB_PROLOG or JOB_EPILOG, respectively. v To run a prolog or epilog program under the user’s environment, specify JOB_USER_PROLOG or JOB_USER_EPILOG, respectively. You do not have to provide a prolog/epilog pair of programs. You may, for example, use only a prolog program that runs under the LoadLeveler user ID. For details about specific keyword syntax and use in the configuration file, see Chapter 12, “Configuration file reference,” on page 263. Note: If process tracking is enabled and your prolog or epilog program invokes the mailx command, set the sendwait variable to prevent the background mail process from being killed by process tracking. A user environment prolog or epilog runs with AFS authentication if installed and enabled. For security reasons, you must code these programs on the machines where the job runs and on the machine that schedules the job. If you do not define a value for these keywords, the user environment prolog and epilog settings on the executing machine are ignored. The user environment prolog and epilog can set environment variables for the job by sending information to standard output in the following format: env id = value Where: id Is the name of the environment variable value Is the value (setting) of the environment variable For example, the user environment prolog sets the environment variable STAGE_HOST for the job: #!/bin/sh echo env STAGE_HOST=shd22 Coding conventions for prolog programs The prolog program is invoked by the starter process. Once the starter process invokes the prolog program, the program obtains information about the job from environment variables. Syntax: Chapter 4. Configuring the LoadLeveler environment 77
  • 98. prolog_program Where prolog_program is the name of the prolog program as defined in the JOB_PROLOG keyword. No arguments are passed to the program, but several environment variables are set. For more information on these environment variables, see “Run-time environment variables” on page 400. The real and effective user ID of the prolog process is the LoadLeveler user ID. If the prolog program requires root authority, the administrator must write a secure C or Perl program to perform the desired actions. You should not use shell scripts with set uid permissions, since these scripts may make your system susceptible to security problems. Return code values: 0 The job will begin. If the prolog program is ended with a signal, the job does not begin and a message is written to the starter log. Sample prolog programs: v Sample of a prolog program for korn shell: #!/bin/ksh # # Set up environment set -a . /etc/environment . /.profile export PATH="$PATH:/loctools/lladmin/bin" export LOG="/tmp/$LOADL_STEP_OWNER.$LOADL_STEP_ID.prolog" # # Do set up based upon job step class # case $LOADL_STEP_CLASS in # A OSL job is about to run, make sure the osl filesystem is # mounted. If status is negative then filesystem cannot be # mounted and the job step should not run. "OSL") mount_osl_files >> $LOG if [ status = 0 ] then EXIT_CODE=1 else EXIT_CODE=0 fi ;; # A simulation job is about to run, simulation data has to # be made available to the job. The status from copy script must # be zero or job step cannot run. "sim") copy_sim_data >> $LOG if [ status = 0 ] then EXIT_CODE=0 else EXIT_CODE=1 fi ;; # All other job will require free space in /tmp, make sure # enough space is available. *) check_tmp >> $LOG 78 TWS LoadLeveler: Using and Administering
  • 99. EXIT_CODE=$? ;; esac # The job step will run only if EXIT_CODE == 0 exit $EXIT_CODE v Sample of a prolog program for C shell: #!/bin/csh # # Set up environment source /u/loadl/.login # setenv PATH "${PATH}:/loctools/lladmin/bin" setenv LOG "/tmp/${LOADL_STEP_OWNER}.${LOADL_STEP_ID}.prolog" # # Do set up based upon job step class # switch ($LOADL_STEP_CLASS) # A OSL job is about to run, make sure the osl filesystem is # mounted. If status is negative then filesystem cannot be # mounted and the job step should not run. case "OSL": mount_osl_files >> $LOG if ($status < 0 ) then set EXIT_CODE = 1 else set EXIT_CODE = 0 endif breaksw # A simulation job is about to run, simulation data has to # be made available to the job. The status from copy script must # be zero or job step cannot run. case "sim": copy_sim_data >> $LOG if ($status == 0 ) then set EXIT_CODE = 0 else set EXIT_CODE = 1 endif breaksw # All other job will require free space in /tmp, make sure # enough space is available. default: check_tmp >> $LOG set EXIT_CODE = $status breaksw endsw # The job step will run only if EXIT_CODE == 0 exit $EXIT_CODE Coding conventions for epilog programs The installation defined epilog program is invoked after a job step has completed. The purpose of the epilog program is to perform any required clean up such as unmounting file systems, removing files, and copying results. The exit status of both the prolog program and the job step is set in environment variables. Syntax: epilog_program Where epilog_program is the name of the epilog program as defined in the JOB_EPILOG keyword. Chapter 4. Configuring the LoadLeveler environment 79
  • 100. No arguments are passed to the program but several environment variables are set. These environment variables are described in “Run-time environment variables” on page 400. In addition, the following environment variables are set for the epilog programs: LOADL_PROLOG_EXIT_CODE The exit code from the prolog program. This environment variable is set only if a prolog program is configured to run. LOADL_USER_PROLOG_EXIT_CODE The exit code from the user prolog program. This environment variable is set only if a user prolog program is configured to run. LOADL_JOB_STEP_EXIT_CODE The exit code from the job step. Note: To interpret the exit status of the prolog program and the job step, convert the string to an integer and use the macros found in the sys/wait.h file. These macros include: v WEXITSTATUS: gives you the exit code v WTERMSIG: gives you the signal that terminated the program v WIFEXITED: tells you if the program exited v WIFSIGNALED: tells you if the program was terminated by a signal The exit codes returned by the WEXITSTATUS macro are the valid codes. However, if you look at the raw numbers in sys/wait.h, the exit code may appear to be 256 times the expected return code. The numbers in sys/wait.h are the wait3 system calls. Sample epilog programs: v Sample of an epilog program for korn shell: #!/bin/ksh # # Set up environment set -a . /etc/environment . /.profile export PATH="$PATH:/loctools/lladmin/bin" export LOG="/tmp/$LOADL_STEP_OWNER.$LOADL_STEP_ID.epilog" # if [ [ -z $LOADL_PROLOG_EXIT_CODE ] ] then echo "Prolog did not run" >> $LOG else echo "Prolog exit code = $LOADL_PROLOG_EXIT_CODE" >> $LOG fi # if [ [ -z $LOADL_USER_PROLOG_EXIT_CODE ] ] then echo "User environment prolog did not run" >> $LOG else echo "User environment exit code = $LOADL_USER_PROLOG_EXIT_CODE" >> $LOG fi # if [ [ -z $LOADL_JOB_STEP_EXIT_CODE ] ] then echo "Job step did not run" >> $LOG else echo "Job step exit code = $LOADL_JOB_STEP_EXIT_CODE" >> $LOG fi # # # Do clean up based upon job step class # case $LOADL_STEP_CLASS in # A OSL job just ran, unmount the filesystem. "OSL") umount_osl_files >> $LOG 80 TWS LoadLeveler: Using and Administering
  • 101. ;; # A simulation job just ran, remove input files. # Copy results if simulation was successful (second argument # contains exit status from job step). "sim") rm_sim_data >> $LOG if [ $2 = 0 ] then copy_sim_results >> $LOG fi ;; # Clean up /tmp *) clean_tmp >> $LOG ;; esac v Sample of an epilog program for C shell: #!/bin/csh # # Set up environment source /u/loadl/.login # setenv PATH "${PATH}:/loctools/lladmin/bin" setenv LOG "/tmp/${LOADL_STEP_OWNER}.${LOADL_STEP_ID}.prolog" # if ( ${?LOADL_PROLOG_EXIT_CODE} ) then echo "Prolog exit code = $LOADL_PROLOG_EXIT_CODE" >> $LOG else echo "Prolog did not run" >> $LOG endif # if ( ${?LOADL_USER_PROLOG_EXIT_CODE} ) then echo "User environment exit code = $LOADL_USER_PROLOG_EXIT_CODE" >> $LOG else echo "User environment prolog did not run" >> $LOG endif # if ( ${?LOADL_JOB_STEP_EXIT_CODE} ) then echo "Job step exit code = $LOADL_JOB_STEP_EXIT_CODE" >> $LOG else echo "Job step did not run" >> $LOG endif # # Do clean up based upon job step class # switch ($LOADL_STEP_CLASS) # A OSL job just ran, unmount the filesystem. case "OSL": umount_osl_files >> $LOG breaksw # A simulation job just ran, remove input files. # Copy results if simulation was successful (second argument # contains exit status from job step). case "sim": rm_sim_data >> $LOG if ($argv{2} == 0 ) then copy_sim_results >> $LOG endif breaksw # Clean up /tmp default: clean_tmp >> $LOG breaksw endsw Using your own mail program You can write a program to override the LoadLeveler default mail notification method. You can use this program, for example, to display your own messages to users when a job completes, or to automate tasks such as sending error messages to a network manager. Chapter 4. Configuring the LoadLeveler environment 81
  • 102. The syntax for the program is the same as it is for standard UNIX mail programs; the command is called with the following arguments: v -s to indicate a subject. v A pointer to a string containing the subject. v A pointer to a string containing a list of mail recipients. The mail message is taken from standard input. To enable this program to replace the default mail notification method, use the MAIL keyword in the configuration file. For details about specific keyword syntax and use in the configuration file, see Chapter 12, “Configuration file reference,” on page 263. 82 TWS LoadLeveler: Using and Administering
  • 103. Chapter 5. Defining LoadLeveler resources to administer After installing LoadLeveler, you may customize it by modifying the administration file. The administration file optionally lists and defines the machines in the LoadLeveler cluster and the characteristics of classes, users, and groups. LoadLeveler does not prevent you from having multiple copies of administration files, but you need to be sure to update all the copies whenever you make a change to one. Having only one administration file prevents any confusion. Table 18 lists the LoadLeveler resources you may define by modifying the administration file. Table 18. Roadmap of tasks for modifying the LoadLeveler administration file To learn about: Read the following: Modifying the administration “Steps for modifying an administration file” file Defining LoadLeveler v “Defining machines” on page 84 resources to administer v “Defining adapters” on page 86 v “Defining classes” on page 89 v “Defining users” on page 97 v “Defining groups” on page 99 v “Defining clusters” on page 100 Correctly specifying Chapter 13, “Administration file reference,” on page 321 administration file keywords Steps for modifying an administration file All LoadLeveler commands, daemons, and processes read the administration and configuration files at start up time. If you change the administration or configuration files after LoadLeveler has already started, any LoadLeveler command or process, such as the LoadL_starter process, will read the newer version of the files while the running daemons will continue to use the data from the older version. To ensure that all LoadLeveler commands, daemons, and processes use the same configuration data, run the reconfiguration command on all machines in the cluster each time the administration or configuration files are changed. Before you begin: You need to: v Ensure that the installation procedure has completed successfully and that the administration file, LoadL_admin, exists in LoadLeveler’s home directory. For additional details about installation, see TWS LoadLeveler: Installation Guide. v Know how to correctly specify keywords in the administration file. For information about administration file keyword syntax and other details, see Chapter 13, “Administration file reference,” on page 321. 83
  • 104. v (Optional) Know how to correctly issue the llextRPD command, if you choose to use it (see “llextRPD - Extract data from an RSCT peer domain” on page 443). Perform the following steps to modify the administration file, LoadL_admin: 1. Identify yourself as a LoadLeveler administrator using the LOADL_ADMIN keyword. 2. Provide the following stanza types in the administration file: v One machine stanza to define the central manager for the LoadLeveler cluster. You also may create machine stanzas for other machines in the LoadLeveler cluster. You can use the llextRPD command to automatically create machine stanzas. v (Optional) An adapter stanza for each type of network adapter that you want LoadLeveler jobs to be able to request. You can use the llextRPD command to automatically create adapter stanzas. 3. (Optional) Specify one or more of the following stanza types: v A class stanza for each set of LoadLeveler jobs that have similar characteristics or resource requirements. v A user stanza for specific users, if their requirements do not match those characteristics defined in the default user stanza. v A group stanza for each set of LoadLeveler users that have similar characteristics or resource requirements. 4. (Optional) You may specify values for additional administration file keywords, which are listed and described in “Administration file keyword descriptions” on page 327. 5. Notify LoadLeveler daemons by issuing the llctl command with either the reconfig or recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the administration file. Defining machines The information in a machine stanza defines the characteristics of that machine. You do not have to specify a machine stanza for every machine in the LoadLeveler cluster, but you must have one machine stanza for the machine that will serve as the central manager. If you do not specify a machine stanza for a machine in the cluster, the machine and the central manager still communicate and jobs are scheduled on the machine but the machine is assigned the default values specified in the default machine stanza. If there is no default stanza, the machine is assigned default values set by LoadLeveler. Any machine name used in the stanza must be a name which can be resolved to an IP address. This name is referred to as an interface name because the name can be used for a program to interface with the machine. Generally, interface names match the machine name, but they do not have to. By default, LoadLeveler will append the DNS domain name to the end of any machine name without a domain name appended before resolving its address. If you specify a machine name without a domain name appended to it and you do not want LoadLeveler to append the DNS domain name to it, specify the name using a trailing period. You may have a need to specify machine names in this way if you are running a cluster with more than one nameserving technique. For 84 TWS LoadLeveler: Using and Administering
  • 105. example, if you are using a DNS nameserver and running NIS, you may have some machine names which are resolved by NIS which you do not want LoadLeveler to append DNS names to. In situations such as this, you also want to specify name_server keyword in your machine stanzas. Under the following conditions, you must have a machine stanza for the machine in question: v If you set the MACHINE_AUTHENTICATE keyword to true in the configuration file, then you must create a machine stanza for each node that LoadLeveler includes in the cluster. v If the machine’s hostname (the name of the machine returned by the UNIX hostname command) does not match an interface name. In this case, you must specify the interface name as the machine stanza name and specify the machine’s hostname using the alias keyword. v If the machine’s hostname does match an interface name but not the correct interface name. For information about automatically creating machine stanzas, see “llextRPD - Extract data from an RSCT peer domain” on page 443. Planning considerations for defining machines There are several matters to consider before customizing the administration file. Before customizing the administration file, consider the following: v Node availability Some workstation owners might agree to accept LoadLeveler jobs only when they are not using the workstation themselves. Using LoadLeveler keywords, these workstations can be configured to be available at designated times only. v Common name space To run jobs on any machine in the LoadLeveler cluster, a user needs the same uid (the user ID number for a user) and gid (the group ID number for a group) on every machine in the cluster. For example, if there are two machines in your LoadLeveler cluster, machine_1 and machine_2, user john must have the same user ID and login group ID in the /etc/passwd file on both machines. If user john has user ID 1234 and login group ID 100 on machine_1, then user john must have the same user ID and login group ID in /etc/passwd on machine_2. (LoadLeveler requires a job to run with the same group ID and user ID of the person who submitted the job.) If you do not have a user ID on one machine, your jobs will not run on that machine. Also, many commands, such as llq, will not work correctly if a user does not have a user ID on the central manager machine. However, there are cases where you may choose to not give a user a login ID on a particular machine. For example, a user does not need an ID on every submit-only machine; the user only needs to be able to submit jobs from at least one such machine. Also, you may choose to restrict a user’s access to a Schedd machine that is not a public scheduler; again, the user only needs access to at least one Schedd machine. v Resource handling Some nodes in the LoadLeveler cluster might have special software installed that users might need to run their jobs successfully. You should configure LoadLeveler to distinguish those nodes from other nodes using, for example, machine features. Chapter 5. Defining LoadLeveler resources to administer 85
  • 106. Machine stanza format and keyword summary Machine stanzas take the following format. Default values for keywords appear in bold: label: type = machine adapter_stanzas = stanza_list alias = machine_name central_manager = true | false | alt cpu_speed_scale = true | false machine_mode = batch | interactive | general master_node_exclusive = true | false max_jobs_scheduled = number name_server = list pool_list = pool_numbers reservation_permitted = true | false resources = name(count) name(count) ... name(count) schedd_fenced = true | false schedd_host = true | false speed = number submit_only = true | false Figure 11. Format of a machine stanza Examples: Machine stanzas These machine stanza examples may apply to your situation. v Example 1 In this example, the machine is being defined as the central manager. # machine_a: type = machine central_manager = true # central manager runs here v Example 2 This example sets up a submit-only node. Note that the submit-only keyword in the example is set to true, while the schedd_host keyword is set to false. You must also ensure that you set the schedd_host to true on at least one other node in the cluster. # machine_b: type = machine central_manager = false # not the central manager schedd_host = false # not a scheduling machine submit_only = true # submit only machine alias = machineb # interface name v Example 3 In the following example, machine_c is the central manager and has an alias associated with it: # machine_c: type = machine central_manager = true # central manager runs here schedd_host = true # defines a public scheduler alias = brianne Defining adapters An adapter stanza identifies network adapters that are available on the machines in the LoadLeveler cluster. 86 TWS LoadLeveler: Using and Administering
  • 107. If you want LoadLeveler jobs to be able to request specific adapters, you must either specify adapter stanzas or configure dynamic adapters in the administration file. Note the following when using an adapter stanza: v An adapter stanza is required for each adapter stanza name you specify on the adapter_stanzas keyword of the machine stanza. v The adapter_name, interface_address and interface_name keywords are required. For information about creating adapter stanzas, see “llextRPD - Extract data from an RSCT peer domain” on page 443 for peer domains. Configuring dynamic adapters LoadLeveler can dynamically determine the adapters in any operating system instance (OSI) that has RSCT installed. LoadLeveler must be told on an OSI basis if it is to handle dynamic adapter configuration changes for that OSI. The specification of whether to use dynamic or static adapter configuration for an OSI is done through the presence or absence of the machine stanza’s adapter_stanzas keyword. If a machine stanza in the administration file contains an adapter_stanzas statement then this is taken as a directive by the LoadLeveler administrator to use only those specified adapters. For this OSI, LoadLeveler will not perform any dynamic adapter configuration or processing. If an adapter change occurs in this OSI then the administrator will have to make the corresponding change in the administration file and then stop and restart or reconfigure the LoadLeveler startd daemon to pick up the adapter changes. If an OSI (machine stanza) in the administration file does not contain the adapter_stanzas keyword then this is taken as a directive by the LoadLeveler administrator for LoadLeveler to dynamically configure the adapters for that OSI. For that OSI, LoadLeveler will determine what adapters are present at startup via calls to the RMCAPI. If an adapter change occurs during execution in the OSI then LoadLeveler will automatically detect and handle the change without requiring a restart or reconfiguration. Configuring InfiniBand adapters InfiniBand adapters, known as host channel adapters (HCAs) can be multiported. Tasks can use ports of an HCA independently, which allows them to be allocated by the scheduling algorithm independently. | Note: InfiniBand adapters are supported on the AIX operating system and in SUSE | Linux Enterprise Server (SLES) 9 and SLES 10 on TWS LoadLeveler for | POWER clusters. An InfiniBand adapter can have multiple adapter ports. Each port on the InfiniBand adapter can be connected to one network and will be managed by TWS LoadLeveler as a switch adapter. InfiniBand adapter ports derive their resources and usage state from the InfiniBand adapter with which they are associated, but are allocated to jobs separately. If you want LoadLeveler jobs to be able to request InfiniBand adapters, you must either specify adapter stanzas or configure dynamic adapters in the administration Chapter 5. Defining LoadLeveler resources to administer 87
  • 108. file. The InfiniBand ports are identified to TWS LoadLeveler in the same way other adapters are. Stanzas are specified in the administration file if static adapters are used and the ports are discovered by RSCT if dynamic adapters are used. The port_number administration keyword has been added to support an InfiniBand port. The port_number keyword specifies the port number of the InfiniBand adapter port. Only InfiniBand ports are managed and displayed by TWS LoadLeveler; the InfiniBand adapter itself is not. The adapter stanza for InfiniBand support only contains the adapter port information. There is no InfiniBand adapter information in the adapter stanza (see example 2 in “Examples: Adapter stanzas” on page 89). Note: 1. TWS LoadLeveler distributes the switch adapter windows of the InfiniBand adapter equally among its ports and the allocation is not adjusted should all of the resources on one port be consumed. 2. The InfiniBand ports determine their usage state and availability from their InfiniBand adapter. If one port is in use exclusively, no other ports on the adapter can be used for any other job. 3. If you have a mixed cluster where some nodes use the InfiniBand adapter and some nodes use the HPS adapter, you have to organize the nodes into pools so that the job is only dispatched to nodes with the same kind of switch adapter. 4. There is no change to the way the InfiniBand adapters are requested on the job command file network statement; that is, InfiniBand adapters are requested the same way as any other adapter would be. 5. Because InfiniBand adapters do not support rCxt blocks, jobs that would otherwise use InfiniBand adapters, but which also request rCxt blocks with the rcxtblks keyword on the network statement will remain in the idle state. This behavior is consistent with how other adapters (for example, the HPS) behave in the same situation. You can use the llstatus -a command to see rCxt blocks on adapters (see “llstatus - Query machine status” on page 512 for more information). Adapter stanza format and keyword summary Consider this format of an adapter stanza. An adapter stanza has the following format: label: type = adapter adapter_name = name adapter_type = type device_driver_name = name interface_address = IP_address interface_name = name logical_id = id multilink_address = ip_address multilink_list = adapter_name <, adapter_name>* network_id = id network_type = type port_number = number switch_node_number = integer Figure 12. Format of an adapter stanza 88 TWS LoadLeveler: Using and Administering
  • 109. Examples: Adapter stanzas These adapter stanza examples may apply to your situation. v Example 1: Specifying an HPS adapter In the following example, the adapter stanza called “c121s0n10.ppd.pok.ibm.com” specifies an HPS adapter. Note that c121s0n10.ppd.pok.ibm.com is also specified on the adapter_stanzas keyword of the machine stanza for the “yugo” machine. yugo: type=machine adapter_stanzas = c121s0n10.ppd.pok.ibm.com ... c121s0n10.ppd.pok.ibm.com: type = adapter adapter_name = sn0 network_type = switch interface_address = 192.168.0.10 interface_name = c121s0n10.ppd.pok.ibm.com multilink_address = 10.10.10.10 logical_id = 2 adapter_type = Switch_Network_Interface_For_HPS device_driver_name = sni0 network_id = 1 c121f2rp02.ppd.pok.ibm.com: type = adapter adapter_name = en0 network_type = ethernet interface_address = 9.114.66.74 interface_name = c121f2rp02.ppd.pok.ibm.com device_driver_name = ent0 v Example 2: Specifying an InfiniBand adapter In the following example, the port_number specifies the port number of the InfiniBand adapter port: 192.168.9.58: type = adapter adapter_name = ib1 network_type = InfiniBand interface_address = 192.168.9.58 interface_name = 192.168.9.58 logical_id = 23 adapter_type = InfiniBand device_driver_name = ehca0 network_id = 18338657682652659714 port_number = 2 Defining classes The information in a class stanza defines characteristics for that class. These characteristics can include the quantities of consumable resources that may be used by a class per machine or cluster. Within a class stanza, you can have optional user substanzas that define policies that apply to a user’s job steps that need to use this class. For more information about user substanzas, see “Defining user substanzas in class stanzas” on page 94. For information about user stanzas, see “Defining users” on page 97. Using limit keywords A limit is the amount of a resource that a job step or a process is allowed to use. (A process is a dispatchable unit of work.) A job step may be made up of several processes. Chapter 5. Defining LoadLeveler resources to administer 89
  • 110. Limits include both a hard limit and a soft limit. When a hard limit is exceeded, the job is usually terminated. When a soft limit is exceeded, the job is usually given a chance to perform some recovery actions. Limits are enforced either per process or per job step, depending on the type of limit. For parallel jobs steps, which consist of multiple tasks running on multiple machines, limits are enforced on a per task basis. The class stanza includes the limit keywords shown in Table 19, which allow you to control the amount of resources used by a job step or a job process. Table 19. Types of limit keywords Limit How the limit is enforced as_limit Per process ckpt_time_limit Per job step core_limit Per process cpu_limit Per process data_limit Per process default_wall_clock_limit Per job step file_limit Per process job_cpu_limit Per job step locks_limit Per process memlock_limit Per process nofile_limit Per process nproc_limit Per user rss_limit Per process stack_limit Per process wall_clock_limit Per job step For example, a common limit is the cpu_limit, which limits the amount of CPU time a single process can use. If you set cpu_limit to five hours and you have a job step that forks five processes, each process can use up to five hours of CPU time, for a total of 25 CPU hours. Another limit that controls the amount of CPU used is job_cpu_limit. For a serial job step, if you impose a job_cpu_limit of five hours, the entire job step (made up of all five processes) cannot consume more than five CPU hours. For information on using this keyword with parallel jobs, see job_cpu_limit keyword. You can specify limits in either the class stanza of the administration file or in the job command file. The lower of these two limits will be used to run the job even if the system limit for the user is lower. For more information, see: v “Enforcing limits” v “Administration file keyword descriptions” on page 327 or “Job command file keyword descriptions” on page 359 Enforcing limits LoadLeveler depends on the underlying operating system to enforce process limits. Users should verify that a process limit such as rss_limit is enforced by the operating system, otherwise setting it in LoadLeveler will have no effect. 90 TWS LoadLeveler: Using and Administering
  • 111. Exceeding job step limits: When a hard limit is exceeded LoadLeveler sends a non-trappable signal (except in the case of a parallel job) to the process group that LoadLeveler created for the job step. When a soft limit is exceeded, LoadLeveler sends a trappable signal to the process group. Any job application that intends to trap a signal sent by LoadLeveler must ensure that all processes in the process group set up the appropriate signal handler. All processes in the job step normally receive the signal. The exception to this rule is when a child process creates its own process group. That action isolates the child’s process, and its children, from any signals that LoadLeveler sends. Any child process creating its own process group is still known to process tracking. So, if process tracking is enabled, all the child processes are terminated when the main process terminates. Table 20 summarizes the actions that the LoadL_starter daemon takes when a job step limit is exceeded. Table 20. Enforcing job step limits Type of Job When a Soft Limit is Exceeded When a Hard Limit is Exceeded Serial SIGXCPU or SIGKILL issued SIGKILL issued Parallel SIGXCPU issued to both the user SIGTERM issued program and to the parallel daemon On systems that do not support SIGXCPU, LoadLeveler does not distinguish between hard and soft limits. When a soft limit is reached on these platforms, LoadLeveler issues a SIGKILL. Enforcing per process limits: For per process limits, what happens when your job reaches and exceeds either the soft limit or the hard limit depends on the operating system you are using. When a job forks a process that exceeds a per process limit, such as the CPU limit, the operating system (not LoadLeveler) terminates the process by issuing a SIGXCPU. As a result, you will not see an entry in the LoadLeveler logs indicating that the process exceeded the limit. The job will complete with a 0 return code. LoadLeveler can only report the status of any processes it has started. If you need more specific information, refer to your operating system documentation. How LoadLeveler uses hard limits: Consider these details on how LoadLeveler uses hard limits. Chapter 5. Defining LoadLeveler resources to administer 91
  • 112. See Table 21 for more information on specifying limits. Table 21. Setting limits If the hard limit is: Then LoadLeveler does the following: Set in both the class stanza and the Smaller of the two limits is taken into consideration. If job command file the smaller limit is the job limit, the job limit is then compared with the user limit set on the machine that runs the job. The smaller of these two values is used. If the limit used is the class limit, the class limit is used without being compared to the machine limit. Not set in either the class stanza or User per process limit set on the machine that runs the job command file the job is used. Set in the job command file and is The job is not submitted. less than its respective job soft limit Set in the class stanza and is less Soft limit is adjusted downward to equal the hard than its respective class stanza soft limit. limit Specified in the job command file Hard limit must be greater than or equal to the specified soft limit and less than or equal to the limit set by the administrator in the class stanza of the administration file. Note: If the per process limit is not defined in the administration file and the hard limit defined by the user in the job command file is greater than the limit on the executing machine, then the hard limit is set to the machine limit. Allowing users to use a class In a class stanza, you may define a list of users or a list of groups to identify those who may use the class. To do so, use the include_users or include_groups keyword, respectively, or you may use both keywords. If you specify both keywords, a particular user must satisfy both the include_users and the include_groups restrictions for the class. This requirement means that a particular user must be defined not only in a User stanza in the administration file, but also in one of the following ways: v The user’s name must appear in the include_users keyword in a Group stanza whose name corresponds to a name in the include_groups keyword of the Class stanza. v The user’s name must appear in the include_groups keyword of the Class stanza. For information about specifying a user name in a group list, see the include_groups keyword description in “Administration file keyword descriptions” on page 327. Class stanza format and keyword summary Class stanzas are optional. Class stanzas take the following format. Default values for keywords appear in bold. 92 TWS LoadLeveler: Using and Administering
  • 113. label: type = class admin= list allow_scale_across_jobs = true | false as_limit= hardlimit,softlimit ckpt_dir = directory ckpt_time_limit = hardlimit,softlimit class_comment = "string" core_limit = hardlimit,softlimit cpu_limit = hardlimit,softlimit data_limit = hardlimit,softlimit default_resources = name(count) name(count)...name(count) default_node_resources = name(count) name(count)...name(count) env_copy = all | master exclude_bg = list exclude_groups = list exclude_users = list file_limit = hardlimit,softlimit include_bg = list include_groups = list include_users = list job_cpu_limit = hardlimit,softlimit locks_limit = hardlimit,softlimit master_node_requirement = true | false max_node = number max_protocol_instances = number max_top_dogs = number max_total_tasks = number maxjobs = number memlock_limit = hardlimit,softlimit nice = value nofile_limit = hardlimit,softlimit nproc_limit = hardlimit,softlimit priority = number rss_limit = hardlimit,softlimit smt = yes | no | as_is stack_limit = hardlimit,softlimit | striping_with_minimum_networks = true | false total_tasks = number wall_clock_limit = hardlimit,softlimit default_wall_clock_limit = hardlimit,softlimit Figure 13. Format of a class stanza Examples: Class stanzas Any of the following class stanza examples may apply to your situation. v Example 1: Creating a class that excludes certain users class_a: type=class # class that excludes users priority=10 # ClassSysprio exclude_users=green judy # Excluded users v Example 2: Creating a class for small-size jobs small: type=class # class for small jobs priority=80 # ClassSysprio (max=100) cpu_limit=00:02:00 # 2 minute limit data_limit=30mb # max 30 MB data segment default_resources=ConsumbableVirtualMemory(10mb) # resources consumed by each ConsumableCpus(1) resA(3) floatinglicenseX(1) # task of a small job step if # resources are not explicitly # specified in the job command file ckpt_time_limit=3:00,2:00 # 3 minute hardlimit, # 2 minute softlimit core_limit=10mb # max 10 MB core file file_limit=50mb # max file size 50 MB Chapter 5. Defining LoadLeveler resources to administer 93
  • 114. stack_limit=10mb # max stack size 10 MB rss_limit=35mb # max resident set size 35 MB include_users = bob sally # authorized users v Example 3: Creating a class for medium-size jobs medium: type=class # class for medium jobs priority=70 # ClassSysprio cpu_limit=00:10:00 # 10 minute run time limit data_limit=80mb,60mb # max 80 MB data segment # soft limit 60 MB data segment ckpt_time_limit=5:00,4:30 # 5 minute hardlimit, # 4 minute 30 second softlimit to checkpoint core_limit=30mb # max 30 MB core file file_limit=80mb # max file size 80 MB stack_limit=30mb # max stack size 30 MB rss_limit=100mb # max resident set size 100 MB job_cpu_limit=1800,1200 # hard limit is 30 minutes, # soft limit is 20 minutes v Example 4: Creating a class for large-size jobs large: type=class # class for large jobs priority=60 # ClassSysprio cpu_limit=00:10:00 # 10 minute run time limit data_limit=120mb # max 120 MB data segment default_resources=ConsumableVirtualMemory(40mb) # resources consumed ConsumableCpus(2) resA(8) floatinglicenseX(1) resB(1) # by each task of # a large job step if resources are not # explicitly specified in the job command file ckpt_time_limit=7:00,5:00 # 7 minute hardlimit, # 5 minute softlimit to checkpoint core_limit=30mb # max 30 MB core file file_limit=120mb # max file size 120 MB stack_limit=unlimited # unlimited stack size rss_limit=150mb # max resident set size 150 MB job_cpu_limit = 3600,2700 # hard limit 60 minutes # soft limit 45 minutes wall_clock_limit=12:00:00,11:59:55 # hard limit is 12 hours v Example 5: Creating a class for master node machines sp-6hr-sp: type=class # class for master node machines priority=50 # ClassSysprio (max=100) ckpt_time_limit=25:00,20:00 # 25 minute hardlimit, # 20 minute softlimit to checkpoint cpu_limit = 06:00:00 # 6 hour limit job_cpu_limit = 06:00:00 # hard limit is 6 hours core_limit = lmb # max 1MB core file master_node_requirement = true # master node definition v Example 6: Creating a class for MPICH-GM jobs MPICHGM: type=class # class for MPICH-GM jobs default_resources = gmports(1) # one gmports resource is consumed by each # task, if resources are not explicitly # specified in the job command file Defining user substanzas in class stanzas In a class stanza, you might define user substanzas using the same syntax as you would for any stanza in the LoadLeveler administration file. A user substanza within a class stanza defines policies that apply to job steps submitted by that user and belonging to that class. User substanzas are optional and are independent of user stanzas (for information about user stanzas, see “Defining users” on page 97). 94 TWS LoadLeveler: Using and Administering
  • 115. Class stanzas that contain user substanzas have the following format: label: { type = class label: { type = user maxidle = number maxjobs = number maxqueued = number max_total_tasks = number } } Figure 14. Format of a user substanza When defining substanzas within other stanzas, you must use opening and closing braces ({ and }) to mark the beginning and the end of the stanza and substanza. The only keywords that are supported in a user substanza are type (required), maxidle, maxjobs, maxqueued, and max_total_tasks. For detailed descriptions of these keywords, see “Administration file keyword descriptions” on page 327. Examples: Substanzas Any of these substanza examples may apply to your situation. In the following example, the default machine and class stanzas do not require braces, but the parallel class stanza does require them. Without braces to open and close the parallel stanza, it would not be clear that the default user and dept_head user stanza belong to the parallel class: default: type = machine central_manager = false schedd_host = true default: type = class wall_clock_limit = 60:00,30:00 parallel: { type = class # Allow at most 50 running jobs for class parallel maxjobs = 50 # Allow at most 10 running jobs for any single # user of class parallel default: { type = user maxjobs = 10 } # Allow user dept_head to run as many as 20 jobs # of class parallel dept_head: {type = user maxjobs = 20 } } dept_head: type = user maxjobs = 30 Chapter 5. Defining LoadLeveler resources to administer 95
  • 116. When user substanzas are used in class stanzas, a default user substanza can be defined. Each class stanza can have its own default user substanza, and even the default class stanza can have a default user substanza. In this example, the default user substanza in the default class indicates that for any combination of class and user, the limits maxidle=20 and maxqueued=30 apply, and that maxjobs and max_total_tasks are unlimited. Some of these values are overridden in the physics class stanza. Here is an example of how class stanzas can be configured: default: { type = class default: { type = user maxidle = 20 maxqueued = 30 maxjobs = -1 max_total_tasks = -1 } } physics: { type = class default: { type = user maxjobs = 10 max_total_tasks = 128 } john: { type = user maxidle = 10 maxjobs = 14 } jane: { type = user max_total_tasks = 192 } } In the following example, the physics stanza shows which values are inherited from which stanzas: physics: { type = class default: { type = user # inherited from default class, default user # maxidle = 20 # inherited from default class, default user # maxqueued = 30 # overrides value of -1 in default class, default user maxjobs = 10 # overrides value of -1 in default class, default user max_total_tasks = 128 } john: { type = user # overrides value of 10 in default user maxidle = 10 # inherited from default user, which was inherited # from default class, default user # maxqueued = 30 # overrides value of 10 in default user maxjobs = 14 96 TWS LoadLeveler: Using and Administering
  • 117. # inherited from default user # max_total_tasks = 128 } jane: { type = user # inherited from default user, which was inherited # from default class, default user # maxidle = 20 # inherited from default user, which was inherited # from default class, default user # maxqueued = 30 # inherited from default user # maxjobs = 10 # overrides value of 128 in default user max_total_tasks = 192 } } Any user other than john and jane who submits jobs of class physics is subject to the constraints in the default user substanza in the physics class stanza. Should john or jane submit jobs of any class other than physics, they are subject to the constraints in the default user substanza in the default class stanza. In addition to specifying a default user substanza within the default class stanza, an administrator can specify other user substanzas in the default class stanza. It is important to note that all class stanzas will inherit all user substanzas from the default class stanza. Note: An important rule to understand is that a user substanza within a class stanza will inherit its values from the user substanza in the default class stanza first, if a substanza for that user is present. The next location a user substanza inherits values from is the default user substanza within the same class stanza. When no default stanzas or substanzas are provided, the LoadLeveler default for all four keywords is -1 or unlimited. If a user substanza is provided for a user on the class exclude_users list, exclude_users takes precedence and the user substanza will be effectively ignored because that user cannot use the class at all. On the other hand, when include_users is used in a class, the presence of a user substanza implies that the user is permitted to use the class (it is as if the user were present on the include_users list). Defining users The information specified in a user stanza defines the characteristics of that user. You can have one user stanza for each user but this is not necessary. If an individual user does not have their own user stanza, that user uses the defaults defined in the default user stanza. User stanza format and keyword summary User stanzas take a particular format. Chapter 5. Defining LoadLeveler resources to administer 97
  • 118. User stanzas take the following format: label: type = user account = list default_class = list default_group = group name default_interactive_class = class name env_copy = all | master fair_shares = number max_node = number max_reservation_duration = number | max_reservation_expiration = number max_reservations = number max_total_tasks = number maxidle = number maxjobs = number maxqueued = number priority = number total_tasks = number Figure 15. Format of a user stanza For more information about the keywords listed in the user stanza format, see Chapter 13, “Administration file reference,” on page 321. Examples: User stanzas Any of the following user stanzas may apply to your situation. v Example 1 In this example, user fred is being provided with a user stanza. User fred’s jobs will have a user priority of 100. If user fred does not specify a job class in the job command file, the default job class class_a will be used. In addition, he can have a maximum of 15 jobs running at the same time. # Define user stanzas fred: type = user priority = 100 default_class = class_a maxjobs = 15 v Example 2 This example explains how a default interactive class for a parallel job is set by presenting a series of user stanzas and class stanzas. This example assumes that users do not specify the LOADL_INTERACTIVE_CLASS environment variable. default: type =user default_interactive_class = red default_class = blue carol: type = user default_class = single double default_interactive_class = ijobs steve: type = user default_class = single double ijobs: type = class wall_clock_limit = 08:00:00 red: type = class wall_clock_limit = 30:00 If the user Carol submits an interactive job, the job is assigned to the default interactive class called ijobs. The job is assigned a wall clock limit of 8 hours. If 98 TWS LoadLeveler: Using and Administering
  • 119. the user Steve submits an interactive job, the job is assigned to the red class from the default user stanza. The job is assigned a wall clock limit of 30 minutes. v Example 3 In this example, Jane’s jobs have a user priority of 50, and if she does not specify a job class in her job command file the default job class small_jobs is used. This user stanza does not specify the maximum number of jobs that Jane can run at the same time so this value defaults to the value defined in the default stanza. Also, suppose Jane is a member of the primary UNIX group “staff.” Jobs submitted by Jane will use the default LoadLeveler group “staff.” Lastly, Jane can use three different account numbers. # Define user stanzas jane: type = user priority = 50 default_class = small_jobs default_group = Unix_Group account = dept10 user3 user4 Defining groups LoadLeveler groups are another way of granting control to the system administrator. Although a LoadLeveler group is independent from a UNIX group, you can configure a LoadLeveler group to have the same users as a UNIX group by using the include_users keyword. Group stanza format and keyword summary The information specified in a group stanza defines the characteristics of that group. Group stanzas are optional and take the following format: label: type = group admin = list env_copy = all | master fair_shares = number exclude_users = list include_users = list max_node = number max_reservation_duration = number | max_reservation_expiration = number max_reservations = number max_total_tasks = number maxidle = number maxjobs = number maxqueued = number priority = number total_tasks = number Figure 16. Format of a group stanza For more information about the keywords listed in the group stanza format, see Chapter 13, “Administration file reference,” on page 321. Examples: Group stanzas Any of the following group stanzas may apply to your situation. v Example 1 Chapter 5. Defining LoadLeveler resources to administer 99
  • 120. In this example, the group name is department_a. The jobs issued by users belonging to this group will have a priority of 80. There are three members in this group. # Define group stanzas department_a: type = group priority = 80 include_users = susann holly fran v Example 2 In this example, the group called great_lakes has five members and these user’s jobs have a priority of 100: # Define group stanzas great_lakes: type = group priority = 100 include_users = huron ontario michigan erie superior Defining clusters The cluster stanza defines the LoadLeveler multicluster environment. Any cluster that wants to participate in the multicluster must have cluster stanzas defined for all clusters with which the local cluster interacts. If you have a cluster stanza defined, LoadLeveler is configured to be in the multicluster environment. Cluster stanza format and keyword summary Cluster stanzas are optional. Cluster stanzas take the following format. Default values for keywords appear in bold. The cluster stanza label must define a unique cluster name within the multicluster environment. label: type = cluster | allow_scale_across_jobs = true | false exclude_classes = class_name[(cluster_name)] ... exclude_groups = group_name[(cluster_name)] ... exclude_users = user_name[(cluster_name)] ... inbound_hosts = hostname[(cluster_name)] ... inbound_schedd_port = port_number include_classes = class_name[(cluster_name)] ... include_groups = group_name[(cluster_name)] ... include_users = user_name[(clustername)] ... local = true | false | main_scale_across_cluster = true | false multicluster_security = SSL outbound_hosts = hostname[(cluster_name)] ... secure_schedd_port = port_number ssl_cipher_list = cipher_list Figure 17. Format of a cluster stanza Examples: Cluster stanzas Any of the following cluster stanzas may apply to your situation. 100 TWS LoadLeveler: Using and Administering
  • 121. SCHEDD_STREAM_PORT = 1966 M1 M6 M7 M2 cluster1 cluster3 M3 M4 M5 cluster2 Figure 18. Multicluster Example Figure 18 shows a simple multicluster with three clusters defined as members. Cluster1 has defined an alternate port number for the Schedds running in its cluster by setting the SCHEDD_STREAM_PORT = 1966. All of the other clusters need to define what port to use when connecting to the inbound Schedds of cluster1 by specifying the inbound_schedd_port = 1966 keyword in the cluster1 stanza. Cluster2 has a single machine connected to cluster1 and 2 machines connected to cluster3. Cluster3 has a single machine connected to both cluster2 and cluster1. Each cluster would set the local keyword to true for their cluster stanza in the cluster’s administration file. Multicluster with 3 clusters defined as members cluster1: type=cluster outbound_hosts = M2(cluster2) M1(cluster3) inbound_hosts = M2(cluster2) M1(cluster3) inbound_schedd_port = 1966 cluster2: type=cluster outbound_hosts = M3(cluster1) M4(cluster3) inbound_hosts = M3(cluster1) M4(cluster3) M5(cluster3) cluster3: type=cluster outbound_hosts = M6 inbound_hosts = M6 Chapter 5. Defining LoadLeveler resources to administer 101
  • 122. 102 TWS LoadLeveler: Using and Administering
  • 123. Chapter 6. Performing additional administrator tasks There are additional ways to modify the LoadLeveler environment that either require an administrator. Table 22 lists additional ways to modify the LoadLeveler environment that either require an administrator to customize both the configuration and administration files, or require the use of the LoadLeveler commands or APIs. Table 22. Roadmap of additional administrator tasks To learn about: Read the following: Setting up the environment for “Setting up the environment for parallel jobs” on page parallel jobs 104 Configuring and using an v “Using the BACKFILL scheduler” on page 110 alternative scheduler v “Using an external scheduler” on page 115 v “Example: Changing scheduler types” on page 126 | Using additional features available v “Preempting and resuming jobs” on page 126 | with the BACKFILL scheduler v “Configuring LoadLeveler to support reservations” on page 131 | v “Working with reservations” on page 213 | v “Data staging” on page 113 Working with AIX’s workload “Steps for integrating LoadLeveler with the AIX balancing component Workload Manager” on page 137 Enabling LoadLeveler’s “LoadLeveler support for checkpointing jobs” on page checkpoint/restart function 139 Enabling LoadLeveler’s affinity v LoadLeveler scheduling affinity (see “LoadLeveler support scheduling affinity support” on page 146) Enabling LoadLeveler’s v “LoadLeveler multicluster support” on page 148 multicluster support v “Configuring a LoadLeveler multicluster” on page 150 | v “Scale-across scheduling with multiclusters” on page | 153 Enabling LoadLeveler’s Blue Gene v “LoadLeveler Blue Gene support” on page 155 support v “Configuring LoadLeveler Blue Gene support” on page 157 Enabling LoadLeveler’s fair share v “Fair share scheduling overview” on page 27 scheduling support v “Using fair share scheduling” on page 160 Moving job records from a down v “Procedure for recovering a job spool” on page 167 Schedd to another Schedd within v “llmovespool - Move job records” on page 472 the local cluster Correctly specifying configuration v Chapter 12, “Configuration file reference,” on page and administration file keywords 263 v Chapter 13, “Administration file reference,” on page 321 Managing LoadLeveler operations 103
  • 124. Table 22. Roadmap of additional administrator tasks (continued) To learn about: Read the following: v Querying status v “llclass - Query class information” on page 433 v “llq - Query job status” on page 479 v “llqres - Query a reservation” on page 500 v “llstatus - Query machine status” on page 512 v Changing attributes of submitted v “llfavorjob - Reorder system queue by job” on page jobs 447 v “llfavoruser - Reorder system queue by user” on page 449 v “llmodify - Change attributes of a submitted job step” on page 464 v “llprio - Change the user priority of submitted job steps” on page 477 v Changing the state of submitted v “llcancel - Cancel a submitted job” on page 421 jobs v “llhold - Hold or release a submitted job” on page 454 Setting up the environment for parallel jobs Additional administration tasks apply to parallel jobs. This topic describes the following administration tasks that apply to parallel jobs: v Scheduling support v Reducing job launch overhead v Submitting interactive POE jobs v Setting up a class v Setting up a parallel master node v Configuring MPICH jobs v Configuring MVAPICH jobs v Configuring MPICH-GM jobs For information on submitting parallel jobs, see “Working with parallel jobs” on page 194. Scheduling considerations for parallel jobs | For parallel jobs, LoadLeveler supports BACKFILL scheduling for efficient use of | system resources. This scheduler runs both serial and parallel jobs. BACKFILL scheduling also supports: v Multiple tasks per node v Multiple user space tasks per adapter v Preemption Specify the LoadLeveler scheduler using the SCHEDULER_TYPE keyword. For more information on this keyword and supported scheduler types, see “Choosing a scheduler” on page 44. 104 TWS LoadLeveler: Using and Administering
  • 125. Steps for reducing job launch overhead for parallel jobs Administrators may define a number of LoadLeveler starter processes to be ready and waiting to handle job requests. Having this pool of ready processes reduces the amount of time LoadLeveler needs to prepare jobs to run. You also may control how environment variables are copied for a job. Reducing the number of environment variables that LoadLeveler has to copy reduces the amount of time LoadLeveler needs to prepare jobs to run. Before you begin: You need to know: v How many jobs might be starting at the same time. This estimate determines how many starter processes to have LoadLeveler start in advance, to be ready and waiting for job requests. v The type of parallel jobs that typically are used. If IBM Parallel Environment (PE) is used for parallel jobs, PE copies the user’s environment to all executing nodes. In this case, you may configure LoadLeveler to avoid redundantly copying the same environment variables. v How to correctly specify configuration keywords. For details about specific keyword syntax and use: – In the administration file, see Chapter 13, “Administration file reference,” on page 321. – In the configuration file, see Chapter 12, “Configuration file reference,” on page 263. Perform the following steps to configure LoadLeveler to reduce job launch overhead for parallel jobs. 1. In the local or global configuration file, specify the number of starter processes for LoadLeveler to automatically start before job requests are submitted. Use the PRESTARTED_STARTERS keyword to set this value. Tip: The default value of 1 should be sufficient for most installations. 2. If typical parallel jobs use a facility such as Parallel Environment, which copies user environment variables to all executing nodes, set the env_copy keyword in the class, user, or group stanzas to specify that LoadLeveler only copy user environment variables to the master node by default. Rules: v Users also may set this keyword in the job command file. If the env_copy keyword is set in the job command file, that setting overrides any setting in the administration file. For more information, see “Step for controlling whether LoadLeveler copies environment variables to all executing nodes” on page 195. v If the env_copy keyword is set in more than one stanza in the administration file, LoadLeveler determines the setting to use by examining all values set in the applicable stanzas. See the table in theenv_copy administration file keyword to determine what value LoadLeveler will use. 3. Notify LoadLeveler daemons by issuing the llctl command with either the reconfig or recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the configuration and administration files. When you are done with this procedure, you can use the POE stderr and LoadLeveler logs to trace actions during job launch. Chapter 6. Performing additional administrator tasks 105
  • 126. Steps for allowing users to submit interactive POE jobs You can set up your system so that users can submit interactive POE jobs to LoadLeveler. Perform the following steps to set up your system so that users can submit interactive POE jobs to LoadLeveler. 1. Make sure that you have installed LoadLeveler and defined LoadLeveler administrators. See “Defining LoadLeveler administrators” on page 43 for information on defining LoadLeveler administrators. 2. If running user space jobs, LoadLeveler must be configured to use switch adapters. A way to do this is to run the llextRPD command to extract node and adapter information from the RSCT peer domain. See “llextRPD - Extract data from an RSCT peer domain” on page 443 for additional information. 3. In the configuration file, define your scheduler to be the LoadLeveler BACKFILL scheduler by specifying SCHEDULER_TYPE = BACKFILL. See “Choosing a scheduler” on page 44 for more information. 4. In the administration file, specify batch, interactive, or general use for nodes. You can use the machine_mode keyword in the machine stanza to specify the type of jobs that can run on a node; you must specify either interactive or general if you are going to run interactive jobs. 5. In the administration file, configure optional functions, including: v Setting up pools: you can organize nodes into pools by using the pool_list keyword in the machine stanza. See “Defining machines” on page 84 for more information. v Enabling SP™ exclusive use accounting: you can specify that the accounting function on an SP system be informed that a job step has exclusive use of a machine by specifying spacct_exclusive_enable = true in the machine stanza (as shown in the previous example). See “Defining machines” on page 84 for more information on these keywords. 6. Consider setting up a class stanza for your interactive POE jobs. See “Setting up a class for parallel jobs” for more information. Define this class to be your default class for interactive jobs by specifying this class name on the default_interactive_class keyword. See “Defining users” on page 97 for more information. Setting up a class for parallel jobs To define the characteristics of parallel jobs run by your installation you should set up a class stanza in the administration file and define a class (in the Class statement in the configuration file) for each task you want to run on a node. Suppose your installation plans to submit long-running parallel jobs, and you want to define the following characteristics: v Only certain users can submit these jobs v Jobs have a 30 hour run time limit v A job can request a maximum of 60 nodes and 120 total tasks v Jobs will have a relatively low run priority The following is a sample class stanza for long-running parallel jobs which takes into account these characteristics: 106 TWS LoadLeveler: Using and Administering
  • 127. long_parallel: type=class wall_clock_limit = 1800 include_users = jack queen king ace priority = 50 total_tasks = 120 max_node = 60 maxjobs = 2 Note the following about this class stanza: v The wall_clock_limit keyword sets a wall clock limit of 1800 seconds (30 hours) for jobs in this class v The include_users keyword allows four users to submit jobs in this class v The priority keyword sets a relative priority of 50 for jobs in this class v The total_tasks keyword specifies that a user can request up to 120 total tasks for a job in this class v The max_node keyword specifies that a user can request up to 60 nodes for a job in this class v The maxjobs keyword specifies that a maximum of two jobs in this class can run simultaneously Suppose users need to submit job command files containing the following statements: node = 30 tasks_per_node = 4 In your LoadL_config file, you must code the Class statement such that at least 30 nodes have four or more long_parallel classes defined. That is, the configuration file for each of these nodes must include the following statement: Class = { "long_parallel" "long_parallel" "long_parallel" "long_parallel" } or Class = long_parallel(4) For more information, see “Defining LoadLeveler machine characteristics” on page 54. | Striping when some networks fail | When multiple networks are configured in a cluster, a job can request striping over | the networks by setting sn_all in the network statement in the job command file. | The striping_with_minimum_networks administration file keyword in the class | stanza is used to tell LoadLeveler how to select nodes for sn_all jobs of a specific | class when one or more networks are unavailable. When | striping_with_minimum_networks is set to false for a class, LoadLeveler will only | select nodes for sn_all jobs of that class where all the networks are up and in the | READY state. When striping_with_minimum_networks is set to true, LoadLeveler | will select a set of nodes where at least more than half of the networks on the | nodes are up and in the READY state. | For example, if there are 8 networks connected to a node and | striping_with_minimum_networks is set to false, all 8 networks would have to be | up and in the READY state to consider that node for sn_all jobs. If | striping_with_minimum_networks is set to true, nodes with at least 5 networks | up and in the READY state would be considered for sn_all jobs Chapter 6. Performing additional administrator tasks 107
  • 128. Setting up a parallel master node LoadLeveler allows you to define a parallel master node that LoadLeveler will use as the first node for a job submitted to a particular class. To set up a parallel master node, code the following keywords in the node’s class and machine stanzas in the administration file: # MACHINE STANZA: (optional) mach1: type = machine master_node_exclusive = true # CLASS STANZA: (optional) pmv3: type = class master_node_requirement = true Specifying master_node_requirement = true forces all parallel jobs in this class to use–as their first node–a machine with the master_node_exclusive = true setting. For more information on these keywords, see “Defining machines” on page 84 and “Defining classes” on page 89. Configuring LoadLeveler to support MPICH jobs The MPICH package can be configured so that LoadLeveler will be used to spawn all tasks in a MPICH application. Using LoadLeveler to spawn MPICH tasks allows LoadLeveler to accumulate accounting data for the tasks and also allows LoadLeveler to ensure that all tasks are terminated when the job completes. For LoadLeveler to spawn the tasks of a MPICH job, the MPICH package must be configured to use the LoadLeveler llspawn.stdio command when starting tasks. To configure MPICH to use llspawn.stdio, set the environment variable RSHCOMMAND to the location of the llspawn.stdio command and run the configure command for the MPICH package. On Linux systems, enter the following: # export RSHCOMMAND=/opt/ibmll/LoadL/full/bin/llspawn.stdio # ./configure Note: This configuration works on MPICH-1.2.7. Additional documentation for MPICH is available from the Argonne National Laboratory web site at http://www-unix.mcs.anl.gov/mpi/mpich1/. Configuring LoadLeveler to support MVAPICH jobs To run MVAPICH jobs under LoadLeveler control, you must specify the llspawn command to replace the default RSHCOMMAND value during software configuration. The compiled MVAPICH implementation code uses the llspawn command to start tasks under LoadLeveler control. This allows LoadLeveler to have total control over the remote tasks for accounting and cleanup. To configure the MVAPICH code to use the llspawn command as RSHCOMMAND, change the mpirun_rsh.c program source code by following these steps before compiling MVAPICH: 1. Replace: 108 TWS LoadLeveler: Using and Administering
  • 129. Void child_handler(int); with: Void child_handler(int); Void term_handler(int); 2. For Linux, replace: #define RSH_CMD “/usr/bin/rsh” #define RSH_CMD “/usr/bin/ssh” with: #define RSH_CMD “/opt/ibmll/LoadL/full/bin/llspawn” #define SSH_CMD “/opt/ibmll/LoadL/full/bin/llpsawn” 3. Replace: signal(SIGCHLD, child_handler); with: signal(SIGCHLD, SIG_IGN); signal(SIGTERM, term_handler); 4. Add the definition for term_handler function at the end: Void term_handler(int signal) { exit(0); } Configuring LoadLeveler to support MPICH-GM jobs To run MPICH-GM jobs under LoadLeveler control, you need to configure the MPICH-GM implementation you are using by specifying the llspawn command as RSHCOMMAND. The compiled MPICH-GM implementation code uses the llspawn command to start tasks under LoadLeveler control. This allows LoadLeveler to have total control over the remote tasks for accounting and cleanup. To configure the MPICH-GM code to use the llspawn command as RSHCOMMAND, change the mpich.make.gcc script before compiling the MPICH-GM: Replace: Setenv RSHCOMMAND /usr/bin/rsh with: Setenv RSHCOMMAND /opt/ibmll/LoadL/full/bin/llspawn LoadLeveler does not manage the GM ports on the Myrinet switch. For LoadLeveler to keep track of the GM ports they must be identified as LoadLeveler consumable resources. Perform the following steps to use consumable resources to manage GM ports: 1. Pick a name for the GM port resource. Example: As an example, this procedure assumes the name is gmports, but you may use another name. Tip: Users who submit MPICH-GM jobs need to know the name that you define for the GM port resource. 2. In the LoadLeveler configuration file, specify the GM port resource name on the SCHEDULE_BY_RESOURCES keyword. Example: Chapter 6. Performing additional administrator tasks 109
  • 130. SCHEDULE_BY_RESOURCES = gmports Tip: If the SCHEDULE_BY_RESOURCES keyword already is specified in the configuration file, you can just add the GM port resource name to other values already listed. 3. In the administration file, specify how many GM ports are available on each machine. Use the resources keyword to specify the GM port resource name and the number of GM ports. Example: resources=gmports(n) Tips: v The resources keyword also must appear in the job command file for an MPICH-GM job. Example: resources=gmports(1) v To determine the value of n use either the number specified in the GM documentation or the number of GM ports you have successfully used. Certain system configurations may not support all available GM ports, so you might need to specify a lower value for the gmports resource than what is actually available. 4. Issue the llctl command with either the reconfig or recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the configuration and administration files. For information about submitting MPICH-GM jobs, see “Running MPICH, MVAPICH, and MPICH-GM jobs” on page 204. Using the BACKFILL scheduler The BACKFILL scheduling algorithm in LoadLeveler is designed to maximize the use of resources to achieve the highest system efficiency, while preventing potentially excessive delays in starting jobs with large resource requirements. These large jobs can run because the BACKFILL scheduler does not allow jobs with smaller resource requirements to continuously use up resources before the larger jobs can accumulate enough resources to run. While BACKFILL can be used for both serial and parallel jobs, the potential advantage is greater with parallel jobs. Job steps are arranged in a queue based on their SYSPRIO order as they arrive from the Schedd nodes in the cluster. The queue can be periodically reordered depending on the value of the RECALCULATE_SYSPRIO_INTERVAL keyword. In each dispatching cycle, as determined by the NEGOTIATOR_INTERVAL and NEGOTIATOR_CYCLE_DELAY configuration keywords, the BACKFILL algorithm examines these job steps sequentially in an attempt to find available resources to run each job step, then dispatches those steps to run. Once the BACKFILL algorithm encounters a job step for which it cannot immediately find enough resources, that job step becomes known as a ″top dog″. The BACKFILL algorithm can allocate multiple top dogs in the same dispatch cycle. By using the MAX_TOP_DOGS configuration keyword (for more information, see Chapter 12, “Configuration file reference,” on page 263), you can define the maximum number of top dogs that the central manager will allocate. For each top dog, the BACKFILL algorithm will attempt to calculate the earliest time at which enough resources will become free to run the corresponding top 110 TWS LoadLeveler: Using and Administering
  • 131. dog. This is based on the assumption that each currently running job step will run until its hard wall clock limit is reached and that when a job step terminates, the resources which that step has been using will become available. The time at which enough currently running job steps will have terminated, meaning enough resources have become available to run a top dog, is called top dog’s future start time. The future start time of each top dog is effectively guaranteed for the remainder of the execution of the BACKFILL algorithm. The resources that each top dog will use at its corresponding start time and for its duration, as specified by its hard wall clock limit, are reserved (not to be confused with the reservation feature available in LoadLeveler). Note: A job that is bound to a reservation is not considered for top-dog scheduling, so there is no top-dog scheduling performed inside reservations. In some cases, it may not be possible to calculate the future start time of a job step. Consider, for example, a case where there are 20 nodes in the cluster and a job step requires 24 nodes to run. Even when all nodes in the cluster are idle, it will not be possible for this job step to run. Only the addition of nodes to the cluster would allow the job step to run, and there is no way the BACKFILL algorithm can make any assumptions about when that could take place. In situations like this, the job step is not considered a ″top dog″, no resources are ″reserved″, and the BACKFILL algorithm goes on to the next job step in the queue. | The BACKFILL scheduling algorithm classifies job steps into distinct types: | REGULAR, TOP DOG, and BACKFILL: | v The REGULAR job step is a job step for which enough resources are currently | available and no top dogs have yet been allocated. | v The TOP DOG job step is a job step for which not enough resources are | currently available, but enough resources are available at a future time and one | of the following conditions is met: | – The TOP DOG job step is not expected to run at a time when any other top | dog is expected to run. | – If the TOP DOG is expected to run at a time when some other top dogs are | expected to run, then it cannot be using resources reserved by such top dogs. | v The BACKFILL job step is a job step for which enough resources are currently | available and one of the following conditions is met: | – The BACKFILL job step is expected to complete before the future start times | of all top dogs, based on the hard wall clock limit of the BACKFILL job step. | – If the BACKFILL job step is not expected to complete before the future start | time of at least one top dog, then it cannot be using resources reserved by the | top dogs that are expected to start before BACKFILL job step is expected to | complete. Table 23 provides a roadmap of BACKFILL scheduler tasks. Table 23. Roadmap of BACKFILL scheduler tasks Subtask Associated instructions (see . . . ) Configuring the BACKFILL v “Choosing a scheduler” on page 44 scheduler v “Tips for using the BACKFILL scheduler” on page 112 v “Example: BACKFILL scheduling” on page 113 Chapter 6. Performing additional administrator tasks 111
  • 132. Table 23. Roadmap of BACKFILL scheduler tasks (continued) Subtask Associated instructions (see . . . ) Using additional LoadLeveler v “Preempting and resuming jobs” on page 126 features available under the v “Configuring LoadLeveler to support reservations” on BACKFILL scheduler page 131 | v “Working with reservations” on page 213 | v “Data staging” on page 113 | v “Scale-across scheduling with multiclusters” on page 153 Use the BACKFILL scheduler v “llclass - Query class information” on page 433 to dispatch and manage jobs v “llmodify - Change attributes of a submitted job step” on page 464 v “llpreempt - Preempt a submitted job step” on page 474 v “llq - Query job status” on page 479 v “llsubmit - Submit a job” on page 531 v “Data access API” on page 560 v “Error handling API” on page 639 v “ll_modify subroutine” on page 677 v “ll_preempt subroutine” on page 686 Tips for using the BACKFILL scheduler There are a number of essential considerations to make when using the BACKFILL scheduler. Note the following when using the BACKFILL scheduler: v To use this scheduler, either users must set a wall-clock limit in their job command file or the administrator must define a wall-clock limit value for the class to which a job is assigned. Jobs with the wall_clock_limit of unlimited cannot be used to backfill because they may not finish in time. v Using wall clock limits that accurately reflect the actual running time of the job steps will result in a more efficient utilization of resources. When a job step’s wall clock limit is substantially longer than the amount of time the job step actually needs, it results in two inefficiencies in the BACKFILL algorithm: – The future start time of a ″top dog″ will be calculated to be much later due to the long wall clock limits of the running job steps, leaving a larger window for BACKFILL job steps to run. This causes the ″top dog″ to start later than it would have if more accurate wall clock limits had been given. – A job step is less likely to be backfilled if its wall clock limit is longer because it is more likely to run past the future start time of a ″top dog″. v You should use only the default settings for the START expression and the other job control functions described in “Managing job status through control expressions” on page 68. If you do not use these default settings, jobs will still run but the scheduler will not be as efficient. For example, the scheduler will not be able to guarantee a time at which the highest priority job will run. v You should configure any multiprocessor (SMP) nodes such that the number of jobs that can run on a node (determined by the MAX_STARTERS keyword) is always less than or equal to the number of processors on the node. v Due to the characteristics of the BACKFILL algorithm, in some cases this scheduler may not honor the MACHPRIO statement. For more information on MACHPRIO, see “Setting negotiator characteristics and policies” on page 45. 112 TWS LoadLeveler: Using and Administering
  • 133. v When using PREEMPT_CLASS rules it is helpful to create a SYSPRIO expression which is consistent with the preemption rules. This can be done by using the ClassSysprio built-in variable with a multiplier, such as SYSPRIO: (ClassSysprio * 10000) - QDate. If classes which appear on the left-hand side of PREEMPT_CLASS rules are given a higher priority than those which appear on the right, preemption won’t be required as often because the job steps which can preempt will be higher in the queue than the job steps which can be preempted. v Entering llq -s against a top-dog step will display that this step is a top-dog. Example: BACKFILL scheduling On a rack with 10 nodes, 8 of the nodes are being used by Job A. Job B has the highest priority in the queue, and requires 10 nodes. Job C has the next highest priority in the queue, and requires only two nodes. Job B has to wait for Job A to finish so that it can use the freed nodes. Because Job A is only using 8 of the 10 nodes, the BACKFILL scheduler can schedule Job C (which only needs the two available nodes) to run as long as it finishes before Job A finishes (and Job B starts). To determine whether or not Job C has time to run, the BACKFILL scheduler uses Job C’s wall_clock_limit value to determine whether or not it will finish before Job A ends. If Job C has a wall_clock_limit of unlimited, it may not finish before Job B’s start time, and it won’t be dispatched. | Data staging | Data staging allows you to stage data needed by a job before the job begins | execution and to move data back to archives when a job has finished execution. A | job can use one inbound data staging step and one outbound data staging step. | The inbound step will be the first to be executed and the outbound step, the last. | LoadLeveler provides data staging for two scenarios: | 1. A single replica of the data files needed by a job have to be created on a | common file system. | 2. A replica of the data files has to be created on every machine on which the job | will run. | LoadLeveler allows you to request the time at which data staging operations | should be scheduled. | 1. A single replica must be created as soon as a job is submitted, regardless of | when the job will be executed. This is the AT_SUBMIT configuration option. | 2. A single replica of the data files must be created as close as possible to | execution time of the job. This is the JUST_IN_TIME configuration option. | 3. A replica must be created on each machine that the job runs on, as close as | possible to execution time of the job. This is also the JUST_IN_TIME | configuration option. | The basic steps involved in data staging include: | 1. A job is submitted that contains data staging keywords. | 2. LoadLeveler generates inbound and outbound data staging steps in accordance | with these keywords. All other steps of the job have an implicit dependency on | the completion of the inbound data staging step. | 3. Scheduling methods: Chapter 6. Performing additional administrator tasks 113
  • 134. | a. With the AT_SUBMIT configuration option, the data staging step is started | first and the application steps are scheduled when its data staging | dependency is satisfied (that is, when the inbound data staging step is | completed). | b. With the JUST_IN_TIME configuration option, the first application step of | the job is scheduled in the future based on the wall clock time specified for | the inbound data staging step. The inbound data staging step is started on | the machines that will be used by the first application step. | 4. When the inbound data staging step completes, all of the application job steps | become eligible for scheduling. The exit code from the inbound data staging | program is made available to all application job steps in the | LL_DSTG_IN_EXIT_CODE environment variable. | 5. When all the application job steps are completed, the outbound data staging | step is started by LoadLeveler. Typically, the outbound data staging step would | be used to move data files back to their archives. | Note: You cannot preempt data staging steps using the llpreempt command or by | specifying the data_stage class in system preemption rules. Similarly, a step | belonging to the data_stage class cannot preempt any other job step. | Configuring LoadLeveler to support data staging | LoadLeveler allows you to specify the execution time for data staging job steps | using the DSTG_TIME keyword. It defaults to the AT_SUBMIT value. To | schedule data staging operation as close to the application as possible, the | JUST_IN_TIME value can be used. DSTG_MIN_SCHEDULING_INTERVAL is a | keyword used to optimize scheduler performance by allowing data staging jobs to | be scheduled only at specific intervals. | A special set of data staging step initiators, called DSTG_MAX_STARTERS, can be | set up for data staging job steps. These initiators will be a distinct set of resources | on the compute node, not included in the MAX_STARTERS set up for compute | jobs. You cannot specify the built-in data_stage class in: | v The CLASS keyword of a job command file | v The default_class keyword in the administration file | For more information about the data staging keywords, see “Configuration file | keyword descriptions” on page 265. | The LoadLeveler administration class stanza keywords can be used to specify | defaults, limits, and restrictions for the built-in data_stage class. The data_stage | class cannot be specified as the default class for a user. You cannot specify the | data_stage class in your job command file. Steps of this class will be automatically | generated by LoadLeveler based on the data staging keywords used in job | command files. | LoadLeveler provides a built-in class called data_stage that can be configured in | the administration file using a class stanza, just as you would do for any other | class. Some examples of how you might use a stanza for the data_stage class are: | v Include and exclude users and groups from this class to control which users are | permitted to use data staging. | v Specifying defaults for resource limits such as cpu_limit or nofile_limit for data | staging steps. 114 TWS LoadLeveler: Using and Administering
  • 135. | v Specifying defaults and maximum allowed values for the dstg_resources job | command file keyword using default_resources and max_resources. | v Limiting the total number of data staging jobs or tasks in the cluster at any one | time using maxjobs or max_total_tasks. | For more information about the data staging keywords, see “Administration file | keyword descriptions” on page 327. | If an inbound data staging job step is soft-bound to a reservation and keyword | dstg_node=any, it can be started ahead of the reservation start time, if data staging | resources are available. In all other cases, data staging steps will run within the | reservation itself. Using an external scheduler The LoadLeveler API provides interfaces that allow an external scheduler to manage the assignment of resources to jobs and dispatching those jobs. The primary interfaces for the tasks of an external scheduler are: v ll_query to obtain information about the LoadLeveler cluster, the machines of the cluster, jobs and AIX Workload Manager. v ll_get_data to obtain information about specific objects such as jobs, machines and adapters. | v ll_start_job_ext to start a LoadLeveler job. | – The ll_start_job_ext subroutine supports both serial and parallel jobs. For | parallel jobs, ll_start_job_ext provides the ability to specify which adapters | are used by the communication protocols of each job task. This assures that | each task uses the same network for communication over a given protocol. The steps for dispatching jobs with an external scheduler are: 1. Gather information about the LoadLeveler cluster ( ll_query(CLUSTER) ). 2. Gather information about the machines in the LoadLeveler cluster ( ll_query(MACHINES) ). 3. Gather information about the jobs in the cluster ( ll_query(JOBS) ). 4. Determine the resources that are currently free. (See the note that follows.) 5. Determine which jobs to start. Assign resources to jobs to be started and dispatch ( ll_start_job_ext(LL_start_job_info_ext*) ). 6. Repeat steps 1 through 5. When an external scheduler is used, the LoadLeveler Negotiator does not keep track of the resources used by jobs started by the external scheduler. There are two ways that an external scheduler can keep track of the free resources available for starting new jobs. The method that should be used depends on whether the external scheduler runs continuously while all scheduling is occurring or is executed to start a finite number of jobs and then terminates: v If the external scheduler runs continuously, it should query the total resources available in the LoadLeveler system with ll_query and ll_get_data. Then it can keep track of the resource assigned to jobs it starts while they are running and return the resources to the available pool when the jobs complete. v If the external scheduler is executed to start a finite number of jobs and then terminates, it must determine the pool of available resources when it first starts. It can do this by first querying the total resources in the LoadLeveler system using ll_query and ll_get_data. Then it would query the jobs in the system Chapter 6. Performing additional administrator tasks 115
  • 136. (again using ll_query), looking for jobs that are running. For each running job, it would remove the resources used by the job from the available pool. After all the running jobs are processed, the available pool would indicate the amount of free resource for starting new jobs. To find out more about dispatching jobs with an external scheduler, use the information in Table 24. Table 24. Roadmap of tasks for using an external scheduler Subtask Associated instructions (see . . . ) Learn about the LoadLeveler functions “Replacing the default LoadLeveler scheduling that are limited or not available when algorithm with an external scheduler” you use an external scheduler Prepare the LoadLeveler environment “Customizing the configuration file to define an for using an external scheduler external scheduler” on page 118 Use an external scheduler to dispatch v “Steps for getting information about the jobs LoadLeveler cluster, its machines, and jobs” on page 118 v “Assigning resources and dispatching jobs” on page 122 Replacing the default LoadLeveler scheduling algorithm with an external scheduler It is important to know how LoadLeveler keywords and commands behave when you replace the default LoadLeveler scheduling algorithm with an external scheduler. LoadLeveler scheduling keywords and commands fall into the following categories: v Keywords not involved in scheduling decisions are unchanged. | v Keywords kept in the job object or in the machine which are used by the | LoadLeveler default scheduler have their values maintained as before and | passed to the data access API. v Keywords used only by the LoadLeveler default scheduler have no effect. Table 25 discusses specific keywords and commands and how they behave when you disable the default LoadLeveler scheduling algorithm. Table 25. Effect of LoadLeveler keywords under an external scheduler Keyword type / name Notes Job command file keywords | class This value is provided by the data access API. | Machines chosen by ll_start_job_ext must have the | class of the job available or the request will be | rejected. | dependency Supported as before. Job objects for which | dependency cannot be evaluated (because a previous | step has not run) are maintained in the NotQueued | state, and attempts to start them using | ll_start_job_ext will result in an error. If the | dependency is met, ll_start_job_ext can start the | proc. 116 TWS LoadLeveler: Using and Administering
  • 137. Table 25. Effect of LoadLeveler keywords under an external scheduler (continued) Keyword type / name Notes | hold ll_start_job_ext cannot start a job that is in Hold | status. | preferences Passed to the data access API. | requirements ll_start_job_ext returns an error if the specified | machines do not match the requirements of the job. | This includes Disk and Virtual Memory | requirements. | startdate The job remains in the Deferred state until the | startdate specified in the job is reached. | ll_start_job_ext cannot start a job in the Deferred | state. | user_priority Used in calculating the system priority (as described | in “Setting and changing the priority of a job” on | page 230). The system priority assigned to the job is | available through the data access API. No other | control of the order in which jobs are run is | enforced. Administration file keywords master_node_exclusive Ignored master_node_requirement Ignored max_jobs_scheduled Ignored max_reservations Ignored max_reservation_duration Ignored max_total_tasks Ignored maxidle Supported maxjobs Ignored maxqueued Supported priority Used to calculate the system priority (where appropriate). | speed Available through the data access API. Configuration file keywords MACHPRIO Calculated but is not used. | MAX_STARTERS Calculated, and if starting the job causes this value | to be exceeded, ll_start_job_ext returns an error. | SYSPRIO Calculated and available to the data access API. NEGOTIATOR_PARALLEL_DEFER Ignored NEGOTIATOR_PARALLEL_HOLD Ignored NEGOTIATOR_RESCAN_QUEUE Ignored NEGOTIATOR_RECALCULATE_ Works as before. Set this value to 0 if you do not SYSPRIO_INTERVAL want the system priorities of job objects recalculated. Chapter 6. Performing additional administrator tasks 117
  • 138. Customizing the configuration file to define an external scheduler | To use an external scheduler, one of the tasks you must perform is setting the | configuration file keyword SCHEDULER_TYPE to the value API. This keyword option provides a time-based (rather than an event-based) interface. That is, your application must use the data access API to poll LoadLeveler at specific times for machine and job information. When you enable a scheduler type of API, you must specify AGGREGATE_ADAPTERS=NO to make the individual switch adapters available to the external scheduler. This means the external scheduler receives each individual adapter connected to the network, instead of collectively grouping them together. You’ll see each adapter listed individually in the llstatus -l command output. When this keyword is set to YES, the llstatus -l command will show an aggregate adapter which contains information on all switch adapters on the same network. For detailed information about individual switch adapters, issue the llstatus -a command. You also may use the PREEMPTION_SUPPORT keyword, which specifies the level of preemption support for a cluster. Preemption allows for a running job step to be suspended so that another job step can run. Steps for getting information about the LoadLeveler cluster, its machines, and jobs There are steps to retrieve and use information about the LoadLeveler cluster, machines, jobs and AIX Workload Manager. Perform the following steps to retrieve and use information about the LoadLeveler cluster, machines, jobs and AIX Workload Manager: 1. Create a query object for the kind of information you want. Example: To query machine information, code the following instruction: LL_element * query_element = ll_query(MACHINES); 2. Customize the query to filter the specific information you want. You can filter the list of objects for which you want information. For some queries, you can also filter how much information you want. Example: The following lines customize the query for just hosts node01.ibm.com and node02.ibm.com and to return the information contained in the llstatus -f command: char * hostlist[] = { "node01.ibm.com","node02.ibm.com",NULL }; ll_set_request(query_element,QUERY_HOST,hostlist,STATUS_LINE); 3. Once the query has been customized: a. Submit it using ll_get_objs, which returns the first object that matches the query. b. Interrogate the returned object using the ll_get_data command to retrieve specific attributes. Depending on the information being queried for, the query may be directed to a specific node and a specific daemon on that node. Example: A JOBS query for all data may be directed to the negotiator, Schedd or the history file. If it is directed to the Schedd, you must specify the host of 118 TWS LoadLeveler: Using and Administering
  • 139. the Schedd you are interested in. The following demonstrates retrieving the name of the first machine returned by the query constructed previously: int machine_count; int rc; LL_element * element =ll_get_objs(query_element,LL_CM,NULL,&machine_count,&rc) char * mname; ll_get_data(element,LL_MachineName,&mname); Because there is only one negotiator in a LoadLeveler cluster, the host does not have to be specified. The third parameter is the address of an integer that will receive the count of objects returned and the fourth parameter is the address of an integer that will receive the completion code of the call. If the call fails, NULL is returned and the location pointed to by the fourth parameter is set to a reason code. If the call succeeds, the value returned is used as the first parameter to a call to ll_get_data. The second parameter to ll_get_data is a specification that indicates what attribute of the object is being interrogated. The third parameter to ll_get_data is the address of the location into which to store the result. ll_get_data returns zero if it is successful and nonzero if an error occurs. It is important that the specification (the second parameter to ll_get_data) be valid for the object passed in (the first parameter) and that the address passed in as the third parameter point to the correct type for the specification. Undefined, potentially dangerous behavior will occur if either of these conditions is not met. Example: Retrieving specific information about machines The following example demonstrates printing out the name and adapter list of all machines in the LoadLeveler cluster. The example could be extended to retrieve all of the information available about the machines in the cluster such as memory, disk space, pool list, features, supported classes, and architecture, among other things. A similar process would be used to retrieve information about the cluster overall. int i, w, rc; int machine_count; LL_element * query_elem; LL_element * machine; LL_element * adapter; char * machine_name; char * adapter_name; int * window_list; int window_count; /* First we need to obtain a query element which is used to pass */ /* parameters in to the machine query */ if ((query_elem = ll_query(MACHINES)) == NULL) { fprintf(stderr,"Unable to obtain query elementn"); /* without the query object we will not be able to do anything */ exit(-1); } /* Get information relating to machines in the LoadLeveler cluster. */ /* QUERY_ALL: we are querying all machines */ /* NULL: since we are querying all machines we do not need to */ /* specify a filter to indicate which machines */ /* ALL_DATA: we want all the information available about the machine */ rc=ll_set_request(query_elem,QUERY_ALL,NULL,ALL_DATA); if(rc<0) { /* A real application would map the return code to a message */ Chapter 6. Performing additional administrator tasks 119
  • 140. printf(" /* Without customizing the query we cannot proceed */ exit(rc); } /* If successful, ll_get_objs() returns the first object that */ /* satisfies the criteria that are set in the query element and */ /* the parameters. In this case those criteris are: */ /* A machine (from the type of query object) */ /* LL_CM: that the negotiator knows about */ /* NULL: since there is only one negotiator we don’t have to */ /* specify which host it is on */ /* The number of machines is returned in machine_count and the */ /* return code is returned in rc */ machine = ll_get_objs(query_elem,LL_CM,NULL,&machine_count,&rc); if(rc<0) { /* A real application would map the return code to a message */ printf(" /* query was not successful -- we cannot proceed but we need to */ /* release the query element */ if(ll_deallocate(query_elem) == -1) { fprintf(stderr,"Attempt to deallocate invalid query elementn"); } exit(rc); } printf("Number of Machines = i = 0; while(machine!=NULL) { printf("------------------------------------------------------n"); printf("Machine int rc = ll_get_data(machine,LL_MachineName,&machine_name); if(0==rc) { printf("Machine name = } else { printf("Error } printf("Adaptersn"); ll_get_data(machine,LL_MachineGetFirstAdapter,&adapter); while(adapter != NULL) { rc = ll_get_data(adapter,LL_AdapterName,&adapter_name); if(0!=rc) { printf("Error } else { /* Because the list of windows on an adapter is returned */ /* as an array of integers, we also need to know how big */ /* the list is. First we query the window count, */ /* storing the result in an integer, then we query for */ /* the list itself, storing the result in a pointer to */ /* an integer. The window list is allocated for us so */ /* we need to free it when we are done */ printf(" ll_get_data(adapter,LL_AdapterTotalWindowCount,&window_count); 120 TWS LoadLeveler: Using and Administering
  • 141. ll_get_data(adapter,LL_AdapterWindowList,&window_list); for (w = 0;w<iBuffer;w++) { printf(" } printf("n"); } free(window_list); /* After the first object has been gotten, GetNext returns */ /* the next until the list is exhausted */ ll_get_data(machine,LL_MachineGetNextAdapter,&adapter); } printf("n"); i++; machine = ll_next_obj(query_elem); } /* First we need to release the individual objects that were */ /* obtained by the query */ if(ll_free_objs(query_elem) == -1) { fprintf(stderr,"Attempt to free invalid query elementn"); } /* Then we need to release the query itself */ if(ll_deallocate(query_elem) == -1) { fprintf(stderr,"Attempt to deallocate invalid query elementn"); } Example: Retrieving information about jobs The following example may apply to your situation. The following example demonstrates retrieving information about jobs up to the point of starting a job: int i, rc; int job_count; LL_element * query_elem; LL_element * job; LL_element * step; int step_state; /* First we need to obtain a query element which is used to pass */ /* parameters in to the jobs query */ if ((query_elem = ll_query(JOBS)) == NULL) { fprintf(stderr,"Unable to obtain query elementn"); /* without the query object we will not be able to do anything */ exit(-1); } /* Get information relating to Jobs in the LoadLeveler cluster. */ printf("Jobs Information ========================================nn"); /* QUERY_ALL: we are querying all jobs */ /* NULL: since we are querying all jobs we do not need to */ /* specify a filter to indicate which jobs */ /* ALL_DATA: we want all the information available about the job */ rc=ll_set_request(query_elem,QUERY_ALL,NULL,ALL_DATA); if(rc<0) { /* A real application would map the return code to a message */ printf(" /* Without customizing the query we cannot proceed */ exit(rc); } Chapter 6. Performing additional administrator tasks 121
  • 142. /* If successful, ll_get_objs() returns the first object that */ /* satisfies the criteria that are set in the query element and */ /* the parameters. In this case those criteris are: */ /* A job (from the type of query object) */ /* LL_CM: that the negotiator knows about */ /* NULL: since there is only one negotiator we don’t have to */ /* specify which host it is on */ /* The number of jobs is returned in job_count and the */ /* return code is returned in rc */ job = ll_get_objs(query_elem,LL_CM,NULL,&job_count,&rc); if(rc<0) { /* A real application would map the return code to a message */ printf(" /* query was not successful -- we cannot proceed but we need to */ /* release the query element */ if(ll_deallocate(query_elem) == -1) { fprintf(stderr,"Attempt to deallocate invalid query elementn"); } exit(rc); } printf("Number of Jobs = step = NULL; while(job!=NULL) { /* Each job is composed of one or more steps which are started */ /* individually. We need to check the state of the job’s steps */ ll_get_data(job,LL_JobGetFirstStep,&step); while(step!=NULL) { ll_get_data(step,LL_StepState,&step_state); /* We are looking for steps that are in idle state. The */ /* state is returned as an int so we cast it to */ /* enum StepState as declared in llapi.h */ if((enum StepState)step_state == STATE_IDLE) break; } /* If we exit the loop with a valid step, it is the one to start */ /* otherwise we need to keep looking */ if(step != NULL) break; ll_next_obj(query_elem); } if(step==NULL) { printf("No step to startn"); exit(0); } Assigning resources and dispatching jobs | After an external scheduler selects a job step to start and identifies the machines | that the job step will run on, the LoadLeveler job start API is used to tell | LoadLeveler the job step to start and the resources that are to be assigned to the | job step. In “Example: Retrieving information about jobs” on page 121, we reached the point where a step to start was identified. In a real external scheduler, the decision would be reached after consideration of all the idle jobs and constructing a priority 122 TWS LoadLeveler: Using and Administering
  • 143. value based on attributes such as class and submit time, all of which are accessible through ll_get_data. Next, the list of available machines would be examined to determine whether a set exists with sufficient resources to run the job. This process also involves determining the size of that set of machines using attributes of the step such as number of nodes, instances of each node and tasks per node. The LoadLeveler data query API allows access to that information about each job but the interface for starting the job does not require that the machine and adapter resource match the specifications when the job was submitted. For example, a job could be submitted specifying node=4 but could be started by an external scheduler on a single node only. Similarly, the job could specify the LAPI protocol with network.lapi=... but be started and told to use the MPI protocol. This is not considered an error since it is up to the scheduler to interpret (and enforce, if necessary), the specifications in the job command file. In allocating adapter resources for a step, it is important that the order of the adapter usages be consistent with the structure of the step. In some environments a task can use multiple instances of adapter windows for a protocol. If the protocol requests striping (sn_all), an adapter window (or set of windows if instances are used) is allocated on each available network. If multiple protocols are used by the task (for example, MPI and LAPI), each protocol defines its own set of windows. The array of adapter usages passed in to ll_start_job_ext must group the windows for all of the instances on one network for the same protocol together. If the protocol requests striping, that grouping must be immediately followed by the grouping for the next network. If the task uses multiple protocols, the set of adapter usages for the first protocol must be immediately followed by the set for the next protocol. Each task will have exactly the same pattern of adapter usage entries. Corresponding entries across all the tasks represent a communication path and must be able to communicate with each other. If the usages are for User Space communication, a network table will be loaded for each set of corresponding entries. All of the job command file keywords for specifying job structure such as total_tasks, tasks_per_node, node=min,max and blocking are supported by the ll_start_job_ext interface but users should ensure that they understand the LoadLeveler model that is created for each combination when constructing the adapter usage list for ll_start_job_ext. Jobs that are submitted with node=number and tasks_per_node result in more regular LoadLeveler models and are easier to create adapter usage lists for. In the following example, it is assumed that the step found to be dispatched will run on one machine with two tasks, each task using one switch adapter window for MPI communication. The name of the machine to run on is contained in the variable use_machine (char*), the names of the switch adapters are contained in use_adapter_1 (char *) and use_adpater_2 (char *) and the adapter windows on those adapters in use_window_1 int) and use_window_2 (int), respectively. Further more, each adapter will be allocated 1M of memory. If the network adapters that the external scheduler assigns to the job allocate communication buffers in rCxt blocks instead of bytes (the Switch Network Interface for HPS is an example of such a network adapter), the api_rcxtblocks field of adapterUsage should be used to specify the number of rCxt blocks to assign instead of the mem field. LL_start_job_info_ext *start_info; char * pChar; LL_element * step; LL_element * job; Chapter 6. Performing additional administrator tasks 123
  • 144. int rc; char * submit_host; char * step_id; start_info = (LL_start_job_info_ext *)(malloc(sizeof(LL_start_job_info_ext))); if(start_info == NULL) { fprintf(stderr, "Out of memory.n"); return; } /* Create a NULL terminated list of target machines. Each task */ /* must have an entry in this list and the entries for tasks on the */ /* same machine must be sequential. For example, if a job is to run */ /* on two machines, A and B, and three tasks are to run on each */ /* machine, the list would be: AAABBB */ /* Any specifications on the job when it was submitted such as */ /* nodes, total_tasks or tasks_per_node must be explicitly queried */ /* and honored by the external scheduler in order to take effect. */ /* They are not automatically enforced by LoadLeveler when an */ /* external scheduler is used. */ /* */ /* In this example, the job will only be run on one machine */ /* with only one task so the machine list consists of only 1 machine */ /* (plus the terminating NULL entry) */ start_info->nodeList = (char **)malloc(2*sizeof(char *)); if (!start_info->nodeList) { fprintf(stderr, "Out of memory.n"); return; } start_info->nodeList[0] = strdup(use_machine); start_info->nodeList[1] = NULL; /* Retrieve information from the job to populate the start_info */ /* structure */ /* In the interest of brevity, the success of the ll_get_data() */ /* is not tested. In a real application it shuld be */ /* The version number is set from the header that is included when */ /* the application using the API is compiled. This allows for */ /* checking that the application was compiled with a version of the */ /* API that is compatible with the version in the library when the */ /* application is run. */ start_info->version_num = LL_PROC_VERSION; /* Get the first step of the job to start */ ll_get_data(job,LL_JobGetFirstStep,&step); if(step==NULL) { printf("No step to startn"); return; } /* In order to set the submitting host, cluster number and proc */ /* number in the start_info structure, we need to parse it out of */ /* the step id */ /* First get the submitting host and save it */ ll_get_data(job,LL_JobSubmitHost,&submit_host); start_info->StepId.from_host = strdup(submit_host); free(submit_host); rc = ll_get_data(step, LL_StepID, &step_id); /* The step id format is submit_host.jobno.stepno . Because the */ 124 TWS LoadLeveler: Using and Administering
  • 145. /* submit host is a dotted string of indeterminant length, the */ /* simplest way to detect where the job number starts is to retrieve */ /* the submit host from the job and skip forward its length in the */ /* step id. */ pChar = step_id+strlen(start_info->StepId.from_host)+1; /* The next segment is the cluster or job number */ pChar = strtok(pChar,"."); start_info->StepId.cluster=atoi(pChar); /* The last token is the proc or step number */ pChar = strtok(NULL,"."); start_info->StepId.proc = atoi(pChar); free(step_id); /* For each protocol (eg. MPI or LAPI) on each task, we need to */ /* specify which adapter to use, whether a window is being used */ /* (subsystem = "US") or not (subsytem="IP"). If a window is used, */ /* the window ID and window buffer size must be specified. */ /* */ /* The adapter usage entries for the protocols of a task must be */ /* sequential and the set of entries for tasks on the same node must */ /* be sequential. For example the twelve entries for a job where */ /* each task uses one window for MPI and one for LAPI with three */ /* tasks per node and running on two nodes would be laid out as: */ /* 1: MPI window for 1st task running on 1st node */ /* 2: LAPI window for 1st task running on 1st node */ /* 3: MPI window for 2nd task running on 1st node */ /* 4: LAPI window for 2nd task running on 1st node */ /* 5: MPI window for 3rd task running on 1st node */ /* 6: LAPI window for 3rd task running on 1st node */ /* 7: MPI window for 1st task running on 2nd node */ /* 8: LAPI window for 1st task running on 2nd node */ /* 9: MPI window for 2nd task running on 2nd node */ /* 10: LAPI window for 2nd task running on 2nd node */ /* 11: MPI window for 3rd task running on 2nd node */ /* 12: LAPI window for 3rd task running on 2nd node */ /* An improperly ordered adapter usage list may cause the job not to */ /* be started or, if started, incorrect execution of the job */ /* */ /* This example starts the job with two tasks on one machine, using */ /* one switch adapter window on each task. The protocol is forced */ /* to MPI and a fixed window size of 1M is used. An actual external */ /* scheduler application would check the steps requirements and its */ /* adapter requirements of the step with ll_get_data */ /* */ start_info->adapterUsageCount = 2; start_info->adapterUsage = (LL_ADAPTER_USAGE *)malloc((start_info->adapterUsageCount) * sizeof(LL_ADAPTER_USAGE)); start_info->adapterUsage[0].dev_name = use_adapter_1; start_info->adapterUsage[0].protocol = "MPI"; start_info->adapterUsage[0].subsystem = "US"; start_info->adapterUsage[0].wid = use_window_1; start_info->adapterUsage[0].mem = 1048577; start_info->adapterUsage[1].dev_name = use_adapter_2; start_info->adapterUsage[1].protocol = "MPI"; start_info->adapterUsage[1].subsystem = "US"; start_info->adapterUsage[1].wid = use_window_2; start_info->adapterUsage[1].mem = 1048577; if ((rc = ll_start_job_ext(start_info)) != API_OK) { printf("Error %d returned attempting to start Job Step %s.%d.%d on %sn", rc, start_info->StepId.from_host, Chapter 6. Performing additional administrator tasks 125
  • 146. start_info->StepId.cluster, start_info->StepId.proc, start_info->nodeList[0] ); } else { printf("ll_start_job_ext() invoked to start job step: " "%s.%d.%d on machine: %s.nn", start_info->StepId.from_host, start_info->StepId.cluster, start_info->StepId.proc, start_info->nodeList[0]); } free(start_info->nodeList[0]); free(start_info); Finally, when the step and job element are no longer in use, ll_free_objs() and ll_deallocate() should be called on the query element. Example: Changing scheduler types You can toggle between the default LoadLeveler scheduler and other types of schedulers by using the SCHEDULER_TYPE keyword. Changes to SCHEDULER_TYPE will not take effect at reconfiguration. The administrator must stop and restart or recycle LoadLeveler when changing SCHEDULER_TYPE. A combination of changes to SCHEDULER_TYPE and some other keywords may terminate LoadLeveler. The following example illustrates how you can toggle between the default LoadLeveler scheduler and an external scheduler, such as the Extensible Argonne Scheduling sYstem (EASY), developed by Argonne National Laboratory and available as public domain code. If you are running the default LoadLeveler scheduler, perform the following steps to switch to an external scheduler: 1. In the configuration file, set SCHEDULER_TYPE = API 2. On the central manager machine: v Issue llctl -g stop and llctl -g start, or v Issue llctl -g recycle If you are running an external scheduler, this is how you can re-enable the LoadLeveler scheduling algorithm: 1. In the configuration file, set SCHEDULER_TYPE = LL_DEFAULT 2. On the central manager machine: v Issue llctl -g stop and llctl -g start, or v Issue llctl -g recycle Preempting and resuming jobs The BACKFILL scheduler allows LoadLeveler jobs to be preempted so that a higher priority job step can run. Administrators may specify not only preemption rules for job classes, but also the method that LoadLeveler uses to preempt jobs. The BACKFILL scheduler supports various methods of preemption. 126 TWS LoadLeveler: Using and Administering
  • 147. Use Table 26 to find more information about preemption. Table 26. Roadmap of tasks for using preemption Subtask Associated instructions (see . . . ) Learn about types of “Overview of preemption” preemption and what it means for preempted jobs Prepare the LoadLeveler “Planning to preempt jobs” on page 128 environment and jobs for preemption Configure LoadLeveler to use “Steps for configuring a scheduler to preempt jobs” on page preemption 130 Overview of preemption LoadLeveler supports two types of preemption. The types of preemption thatLoadLeveler supports are of the following two types: v System-initiated preemption – Automatically enforced by LoadLeveler, except for job steps running under a reservation. – Governed by the PREEMPT_CLASS rules defined in the global configuration file. – When resources required by an incoming job are in use by other job steps, all or some of those job steps in certain classes may be preempted according to the PREEMPT_CLASS rules. – An automatically preempted job step will be resumed by LoadLeveler when resources become available and conditions such as START_CLASS rules are satisfied. – An automatically preempted job step cannot be resumed using llpreempt command or ll_preempt subroutine. v User-initiated preemption – Manually initiated by LoadLeveler administrators using llpreempt command or ll_preempt subroutine. – A manually preempted job step cannot be resumed automatically by LoadLeveler. – A manually preempted job step can be resumed using llpreempt command or ll_preempt subroutine. Issuing this command or subroutine, however, does not guarantee that the job step will successfully be resumed. A manually preempted job step that was resumed through these interfaces competes for resources with system-preempted job steps, and will be resumed only when resources become available. – All steps in a set of coscheduled job steps will be preempted if one or more steps in the step is preempted. – A coscheduled step will not be resumed until all steps in the set of coscheduled job steps can be resumed. For the BACKFILL scheduler only, administrators may select which method LoadLeveler uses to preempt and resume jobs. The suspend method is the default behavior, and is the preemption method LoadLeveler uses for any external schedulers that support preemption. For more information about preemption methods, see “Planning to preempt jobs” on page 128. Chapter 6. Performing additional administrator tasks 127
  • 148. For a preempted job to be resumed after system- or user-initiated preemption occurs through a method other than suspend, the restart keyword in the job command file must be set to yes. Otherwise, LoadLeveler vacates the job step and removes it from the cluster. In order to determine the preempt type and preempt method to use when a coscheduled step preempts another step, an order of precedence for preempt types and preempt methods has been defined. All steps in the preempting coscheduled step will be examined and the preempt type and preempt method having the highest precedence will be used. The order of precedence for preempt type will be ALL, ENOUGH. The precedence order for preempt method will be remove, vacate, system hold, user hold, suspend. When coscheduled steps are running, if one step is preempted as a result of a system initiated preemption, then all coscheduled steps will be preempted. This implies that more resource than necessary might be preempted when one of the steps being preempted is a coscheduled step. Planning to preempt jobs There are points to consider when planning to use preemption. Consider the following points when planning to use preemption: v Avoiding circular preemption under the BACKFILL scheduler BACKFILL scheduling enables job preemption using rules specified with the PREEMPT_CLASS keyword. When you are setting up the preemption rules, make sure that you do not create a circular preemption path. Circular preemption causes a job class to preempt itself after applying the preemption rules recursively. For example, the following keyword definitions set up circular preemption rules on Class_A: PREEMPT_CLASS[Class_A] = ALL { Class_B } PREEMPT_CLASS[Class_B] = ALL { Class_C } PREEMPT_CLASS[Class_C] = ENOUGH { Class_A } Another example of circular preemption involves allclasses: PREEMPT_CLASS[Class_A] = ENOUGH {allclasses} PREEMPT_CLASS[Class_B] = ALL {Class_A} In this instance, allclasses means all classes except Class_A, any additional preemption rule preempting Class_A causes circular preemption. v Understanding implied START_CLASS values Using the ″ALL″ value in the PREEMPT_CLASS keyword places implied restrictions on when a job can start. For example, PREEMPT_CLASS[Class_A] = ALL {Class_B Class_C} tells LoadLeveler two things: 1. If a new Class_A job is about to run on a node set, then preempt all Class_B and Class_C jobs on those nodes 2. If a Class_A job is running on a node set, then do not start any Class_B or Class_C jobs on those nodes This PREEMPT_CLASS statement also implies the following START_CLASS expressions: 1. START_CLASS[Class_B] = (Class_A < 1) 2. START_CLASS[Class_C] = (Class_A < 1) 128 TWS LoadLeveler: Using and Administering
  • 149. LoadLeveler adds all implied START_CLASS expressions to the START_CLASS expressions specified in the configuration file. This overrides any existing values for START_CLASS. For example, if the configuration file contains the following statements: PREEMPT_CLASS[Class_A] = ALL {Class_B Class_C} START_CLASS[Class_B] = (Class_A < 5) START_CLASS[Class_C] = (Class_C < 3) When LoadLeveler runs through the configuration process, the PREEMPT_CLASS statement on the first line generates the two implied START_CLASS statements. When the implied START_CLASS statements get added in, the user specified START_CLASS statements are overridden and the resulting START_CLASS statements are effectively equivalent to: START_CLASS[Class_B] = (Class_A < 1) START_CLASS[Class_C] = (Class_C < 3) && (Class_A < 1) Note: LoadLeveler’s central manager (CM) uses these effective expressions instead of the original statements specified in the configuration file. The output from llclass -l displays the original customer specified START_CLASS expressions. v Selecting the preemption method under the BACKFILL scheduler Use Table 27 and Table 28 on page 130 to determine which preemption you want to use for jobs running under the BACKFILL scheduler. You may define one or more of the following: – A default preemption method to be used for all job classes, by setting the DEFAULT_PREEMPT_METHOD keyword in the configuration file. – A specific preemption method for one or more classes or job steps, by using an option on: - The PREEMPT_CLASS statement in the configuration file. - The llpreempt command, ll_preempt subroutine or ll_preempt_jobs subroutine. Note: 1. Process tracking must be enabled in order to use the suspend method to preempt a job. To configure LoadLeveler for process tracking, see “Tracking job processes” on page 70. 2. For a preempted job to be resumed after system- or user-initiated preemption occurs through a method other than suspend and remove, the restart keyword in the job command file must be set to yes. Otherwise, LoadLeveler vacates the job step and removes it from the cluster. Table 27. Preemption methods for which LoadLeveler automatically resumes preempted jobs Preemption LoadLeveler resumes preempted job: method (abbreviation) At this time At this location At this processing point Suspend (su) When preempting job On the same nodes At the point of suspension completes Vacate (vc) When nodes are Any nodes that meet At the beginning or at the available job requirements last successful checkpoint Chapter 6. Performing additional administrator tasks 129
  • 150. Table 28. Preemption methods for which administrator or user intervention is required Preemption LoadLeveler resumes preempted job: method (abbreviation) Required intervention At this location At this processing point Remove (rm) Administrator or user must Any nodes that At the beginning or at resubmit the preempted job meet job the last successful requirements, checkpoint System Hold Administrator must release when they are (sh) the preempted job available User Hold (uh) User must release the preempted job v Understanding how LoadLeveler treats resources held by jobs to be preempted When a job step is running, it may be holding the following resources: – Processors – Scheduling slots – Real memory | – ConsumableCpus, ConsumableMemory, ConsumableVirtualMemory, and | ConsumableLargePageMemory – Communication switches, if the PREEMPTION_TYPE keyword is set to FULL in the configuration file. When LoadLeveler suspends preemptable jobs running under the BACKFILL scheduler, certain resources held by those jobs do not become available for the | preempting jobs. These resources include ConsumableVirtualMemory, | ConsumableLargePageMemory, and floating resources. Under the BACKFILL scheduler only, LoadLeveler releases these resources when you select a preemption method other than suspend. For all preemption methods other than suspend, LoadLeveler treats all job-step resources as available when it preempts the job step. v Understanding how LoadLeveler processes multiple entries for the same keywords If there are multiple entries for the same keyword in either a configuration file or an administration file, the last entry wins. For example, the following statements are all valid specifications for the same keyword START_CLASS: START_CLASS [Class_B] = (Class_A < 1) START_CLASS [Class_B] = (Class_B < 1) START_CLASS [Class_B] = (Class_C < 1) However, all three statements identify Class_B as the incoming class. LoadLeveler resolves these statements according to the ″last one wins″ rule. Because of that, the actual value used for the keyword is (Class_C < 1). Steps for configuring a scheduler to preempt jobs You need to know certain details about the job characteristics and workload at your installation before you begin to define rules for starting and preempting jobs. Before you begin: v To define rules for starting and preempting jobs, you need to know certain details about the job characteristics and workload at your installation, including: – Which jobs require the same resources, or must be run on the same machines, and so on. This knowledge allows you to group specific jobs into a class. – Which jobs or classes have higher priority than others. This knowledge allows you to define which job classes can preempt other classes. 130 TWS LoadLeveler: Using and Administering
  • 151. v To correctly configure LoadLeveler to preempt jobs, you might need to refer to the following information: – “Choosing a scheduler” on page 44. – “Planning to preempt jobs” on page 128. – Chapter 12, “Configuration file reference,” on page 263. – Chapter 13, “Administration file reference,” on page 321. – “llctl - Control LoadLeveler daemons” on page 439. Perform the following steps to configure a scheduler to preempt jobs: 1. In the configuration file, use the SCHEDULER_TYPE keyword to define the type of LoadLeveler or external scheduler you want to use. Of the LoadLeveler schedulers, only the BACKFILL scheduler supports preemption. Rule: If you select the BACKFILL or API scheduler, you must set the PREEMPTION_SUPPORT configuration keyword to either full or no_adapter. 2. (Optional) In the configuration file, use the DEFAULT_PREEMPT_METHOD to define the default method that the BACKFILL scheduler should use for preempting jobs. | Alternative: You also may set the preemption method through the | PREEMPT_CLASS keyword or on the LoadLeveler preemption command or | APIs, which override the setting for the DEFAULT_PREEMPT_METHOD | keyword. 3. For either the BACKFILL or API scheduler, to preempt by the suspend method requires that you set the PROCESS_TRACKING configuration keyword to true. 4. In the configuration file, use the PREEMPT_CLASS and START_CLASS to define the preemption and start policies for job classes. 5. In the administration file, use the max_total_tasks keyword to define the maximum number of tasks that may be run per user, group, or class. 6. On the central manager machine: v Issue llctl -g stop and llctl -g start, or v Issue llctl -g recycle When you are done with this procedure, you can use the llq command to determine whether jobs are being preempted and resumed correctly. If not, use the LoadLeveler logs to trace the actions of each daemon involved in preemption to determine the problem. Configuring LoadLeveler to support reservations | Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make | reservations or recurring reservations, which specify one or more time periods | during which specific node resources are reserved for use by particular users or | groups. Normally, jobs wait to be dispatched until the resources they require become available. Through the use of reservations, wait time can be reduced because only | jobs that are bound to the reservation may use the node resources as soon as the reservation period begins. Chapter 6. Performing additional administrator tasks 131
  • 152. Reservation tasks for administrators Use Table 29 to find additional information about reservations. Table 29. Roadmap of reservation tasks for administrators Subtask Associated instructions (see . . . ) Learn how reservations work in the v “Overview of reservations” on page 25 LoadLeveler environment v “Understanding the reservation life cycle” on page 214 Configuring a LoadLeveler cluster to v “Steps for configuring reservations in a support reservations LoadLeveler cluster” v “Examples: Reservation keyword combinations in the administration file” on page 134 v “Collecting accounting data for reservations” on page 63 Working with reservations: “Working with reservations” on page 213 v Creating reservations v Submitting jobs under a reservation v Managing reservations Correctly coding and using administration v Chapter 13, “Administration file reference,” and configuration keywords on page 321 v Chapter 12, “Configuration file reference,” on page 263 Steps for configuring reservations in a LoadLeveler cluster Only the BACKFILL scheduler supports the use of reservations. Before you begin: v For information about configuring the BACKFILL scheduler, see “Choosing a scheduler” on page 44. v You need to decide: – Which users will be allowed to create reservations. – How many reservations users may own, and how long a duration for their reservations will be allowed. – Which nodes will be used for reservations. – How much setup time is required before the reservation period starts. – Whether accounting data for reservations is to be saved. | – The maximum lifetime for a recurring reservation before you require the user | to request a new reservation for that job. | – Additional system-wide limitations that you may want to implement such as | maintenance time blocks for specific node sets. v For examples of possible reservation keyword combinations, see “Examples: Reservation keyword combinations in the administration file” on page 134. v For details about specific keyword syntax and use: – In the administration file, see Chapter 13, “Administration file reference,” on page 321. – In the configuration file, see Chapter 12, “Configuration file reference,” on page 263. | Perform the following steps to configure reservations: 132 TWS LoadLeveler: Using and Administering
  • 153. 1. In the administration file, modify the user or group stanzas to authorize users to create reservations. You may grant the ability to create reservations to an individual user, a group of users, or a combination of users and groups. To do so, define the following keywords in the appropriate user or group stanzas: v max_reservations, to set the maximum number of reservations that a user or group may have. v (Optional) max_reservation_duration, to set the maximum amount of time for the reservation period. Tip: To quickly set up and use reservations, use one of the following examples: v To allow every user to create a reservation, add max_reservations=1 to the default user stanza. Then every administrator or user may create a reservation, as long as the number of reservations has not reached the limit for a LoadLeveler cluster. v To allow a specific group of users to make 10 reservations, add max_reservations=10 to the group stanza for that LoadLeveler group. Then every user in that group may create a reservation, as long as the number of reservations has not reached the limit for that group or for a LoadLeveler cluster. See the max_reservations description in Chapter 13, “Administration file reference,” on page 321 for more information about setting this keyword in the user or group stanza. 2. In the administration file, modify the machine stanza of each machine that may be reserved. To do so, set the reservation_permitted keyword to true. Tip: If you want to allow every machine to be reserved, you do not have to set this keyword; by default, any LoadLeveler machine may be reserved. If you want to prevent particular machines from being reserved, however, you must define a machine stanza for that machine and set the reservation_permitted keyword to false. 3. In the global configuration file, set reservation policy by specifying values for the following keywords: v MAX_RESERVATIONS to specify the maximum number of reservations per cluster. | Note: A recurring reservation only counts as one reservation towards the | MAX_RESERVATIONS limit regardless of the number of times that | the reservation recurs. v RESERVATION_CAN_BE_EXCEEDED to specify whether LoadLeveler will be permitted to schedule job steps bound to a reservation when their expected end times exceed the reservation end time. The default for this keyword is TRUE, which means that LoadLeveler will schedule these bound job steps even when they are expected to continue running beyond the time at which the reservation ends. Whether these job steps run and successfully complete depends on resource availability, which is not guaranteed after the reservation ends. In addition, these job steps become subject to preemption rules after the reservation ends. Tip: You might want to set this keyword value to FALSE to prevent users from binding long-running jobs to run under reservations of short duration. v RESERVATION_MIN_ADVANCE_TIME to define the minimum time between the time at which a reservation is created and the time at which the reservation is to start. Tip: To reduce the impact to the currently running workload, consider changing the default for this keyword, which allows reservations to begin as soon as they are created. You may, for example, require reservations to be Chapter 6. Performing additional administrator tasks 133
  • 154. made at least one day (1440 minutes) in advance, by specifying RESERVATION_MIN_ADVANCE_TIME=1440 in the global configuration file. v RESERVATION_PRIORITY to define whether LoadLeveler administrators may reserve nodes on which running jobs are expected to end after the start time for the reservation. Tip: The default for this keyword is NONE, which means that LoadLeveler will not reserve a node on which running jobs are expected to end after the start time for the reservation. If you want to allow LoadLeveler administrators to reserve specific nodes regardless of the expected end times of job steps currently running on the node, set this keyword value to HIGH. Note, however, that setting this keyword value to HIGH might increase the number of job steps that must be preempted when LoadLeveler sets up the reservation, and many jobs might remain in Preempted state. This also applies to Blue Gene job steps. This keyword value applies only for LoadLeveler administrators; other reservation owners do not have this capability. v RESERVATION_SETUP_TIME to define the amount of time LoadLeveler uses to prepare for a reservation before it is to start. 4. (Optional) In the global configuration file, set controls for the collection of accounting data for reservations: v To turn on accounting for reservations, add the A_RES flag to the ACCT keyword. v To specify a file other than the default history file to contain the data, use the RESERVATION_HISTORY keyword. To learn how to collect accounting data for reservations, see “Collecting accounting data for reservations” on page 63. 5. If LoadLeveler is already started, to process the changes you made in the preceding steps, issue the command llctl -g reconfig. Tip: If you have changed the value of only the RESERVATION_PRIORITY keyword, issue the command llctl reconfig only on the central manager node. Result: The new keyword values take effect immediately, but they do not change the attributes of existing reservations. When you are done with this procedure, you may perform additional tasks described in “Working with reservations” on page 213. Examples: Reservation keyword combinations in the administration file The following examples demonstrate LoadLeveler behavior when the max_reservations and max_reservation_duration keywords are set. The examples assume that only the user and group stanzas listed exist in the LoadLeveler administration file. v Example 1: Assume the administration file contains the following stanzas: default: type = user maxjobs = 10 group2: type = group include_users = rich dave steve rich: type = user default_group = group2 134 TWS LoadLeveler: Using and Administering
  • 155. This example shows that, by default, no one is allowed to make any reservations. No one, including LoadLeveler administrators, is permitted to make any reservations unless the max_reservations keyword is used. v Example 2: Assume the administration file contains the following stanzas: default: type = user maxjobs = 10 group2: type = group include_users = rich dave steve rich: type = user default_group = group2 max_reservations = 5 This example shows how permission to make reservations can be granted to a specific user through the user stanza only. Because the max_reservations keyword is not used in any group stanza, by default, the group stanzas neither grant permissions nor put any restrictions on reservation permissions. User Rich can make reservations in any group (group2, No_Group, Group_A, and so on), whether or not the group stanzas exist in the LoadLeveler administration file. The total number of reservations user Rich can own at any given time is limited to five. v Example 3: Assume the administration file contains the following stanzas: default: type = user maxjobs = 10 group2: type = group include_users = rich dave steve max_reservations = 5 rich: type = user default_group = group2 This example shows how permission to make reservations can be granted to a group of users through the group stanza only. Because the max_reservations keyword is not used in any user stanza, by default, the user stanzas neither grant nor deny permission to make reservations. All users in group2 (Rich, Dave and Steve) can make reservations, but they must make reservations in group2 because other groups do not grant the permission to make reservations. The total number of reservations the users in group2 can own at any given time is limited to five. v Example 4: Assume the administration file contains the following stanzas: default: type = user maxjobs = 10 group2: type = group include_users = rich dave steve max_reservations = 5 rich: type = user default_group = group2 max_reservations = 0 This example shows how permission to make reservations can be granted to a group of users except one specific user. Because the max_reservations keyword is set to zero in the user stanza for Rich, he does not have permission to make any reservation, even though all other users in group2 (Dave and Steve) can make reservations. v Example 5: Assume the administration file contains the following stanzas: Chapter 6. Performing additional administrator tasks 135
  • 156. default: type = group max_reservations = 0 default: type = user max_reservations = 0 group2: type = group include_users = rich dave steve max_reservations = 5 rich: type = user default_group = group2 max_reservations = 5 dave: type = user max_reservations = 2 This example shows how permission to make reservations can be granted to specific user and group pairs. Because the max_reservations keyword is set to zero in both the default user and group stanza, no one has permission to make any reservation unless they are specifically granted permission through both the user and group stanza. In this example: – User Rich can own at any time up to five reservations in group2 only. – User Dave can own at any time up to two reservations in group2 only. The total number of reservations they can own at any given time is limited to five. No other combination of user or group pairs can make any reservations. v Example 6: Assume the administration file contains the following stanzas: default: type = user max_reservations = 1 This example permits any user to make one reservation in any group, until the number of reservations reaches the maximum number allowed in the LoadLeveler cluster. v Example 7: Assume the administration file contains the following stanzas: default: type = group max_reservations = 0 default: type = user max_reservations = 0 group1: type = group max_reservations = 6 max_reservation_duration = 1440 carol: type = user default_group = group1 max_reservations = 4 max_reservation_duration = 720 dave: type = user default_group = group1 max_reservations = 4 max_reservation_duration = 2880 In this example, two users, Carol and Dave, are members of group1. Neither Carol nor Dave belong to any other group with a group stanza in the LoadLeveler administration file, although they may use any string as the name of a LoadLeveler group and belong to it by default. Because the max_reservations keyword is set to zero in the default group stanza, reservations can be made only in group1, which has an allotment of six reservations. Each reservation can have a maximum duration of 1440 minutes (24 hours). 136 TWS LoadLeveler: Using and Administering
  • 157. Considering only the user-stanza attributes for reservations: – User Carol can make up to four reservations with each having a maximum duration of 720 minutes (12 hours). – User Dave can make up to four reservations with each having a maximum duration of 2880 minutes (48 hours). If there are no reservations in the system and user Carol wants to make four reservations, she may do so. Each reservation can have a maximum duration of no more than 720 minutes. If Carol attempts to make a reservation with a duration greater than 720 minutes, LoadLeveler will not make the reservation because it exceeds the duration allowed for Carol. Assume that Carol has created four reservations, and user Dave now wants to create four reservations: – The number of reservations Dave may make is limited by the state of Carol’s reservations and the maximum limit on reservations for group1. If the four reservations Carol made are still being set up, or are active, active shared or waiting, LoadLeveler will restrict Dave to making only two reservations at this time. – Because the value of max_reservation_duration for the group is more restrictive than max_reservation_duration for user Dave, LoadLeveler enforces the group value, 1440 minutes. If Dave belonged to another group that still had reservations available, then he could make reservations under that group, assuming the maximum number of reservations for the cluster had not been met. However, in this example, Dave cannot make any further reservations because they are allowed in group1 only. Steps for integrating LoadLeveler with the AIX Workload Manager | Another administrative setup task you must consider is whether you want to | enforce resource usage of ConsumableCpus, ConsumableMemory, | ConsumableVirtualMemory, and ConsumableLargePageMemory. | If you want to control these resources, AIX Workload Manager (WLM) can be | integrated with LoadLeveler to balance workloads at the machine level. When you | are using WLM, workload balancing is done by assigning relative priorities to job | processes. These job priorities prevent one job from monopolizing system resources | when that resource is under contention. | Note: WLM is not supported in LoadLeveler for Linux. | To integrate LoadLeveler and WLM, perform the following steps: | 1. As required for your use, define the applicable options for ConsumableCpus, | ConsumableMemory, ConsumableVirtualMemory, or | ConsumableLargePageMemory as consumable resources in the | SCHEDULE_BY_RESOURCES global configuration keyword. This enables the | LoadLeveler scheduler to consider these consumable resources. | 2. As required for your use, define the applicable options for ConsumableCpus, | ConsumableMemory, ConsumableVirtualMemory, or | ConsumableLargePageMemory in the ENFORCE_RESOURCE_USAGE global | configuration keyword. This enables enforcement of these consumable resources | by AIX WLM. | 3. Define hard, soft or shares in the ENFORCE_RESOURCE_POLICY | configuration keyword. This defines what policy is used by LoadLeveler for | CPUs and real memory when setting WLM class resource entitlements. Chapter 6. Performing additional administrator tasks 137
  • 158. 4. (Optional) Set the ENFORCE_RESOURCE_MEMORY configuration keyword to true. This setting allows AIX WLM to limit the real memory usage of a WLM class as precisely as possible. When a class exceeds its limit, all processes in the class are killed. Rule: ConsumableMemory must be defined in the ENFORCE_RESOURCE_USAGE keyword in the global configuration file, or LoadLeveler does not consider the ENFORCE_RESOURCE_MEMORY keyword to be valid. Tips: v When set to true, the ENFORCE_RESOURCE_MEMORY keyword overrides the policy set through the ENFORCE_RESOURCE_POLICY keyword for ConsumableMemory only. The ENFORCE_RESOURCE_POLICY keyword value still applies for ConsumableCpus. v ENFORCE_RESOURCE_MEMORY may be set in either the global or the local configuration file. In the global configuration file, this keyword sets the default value for all the machines in the LoadLeveler cluster. If the keyword also is defined in a local file, the local setting overrides the global setting. | 5. Using the resources keyword in a machine stanza in the administration file, | define the CPU, real memory, virtual memory, and large page machine | resources available for user jobs. v The ConsumableCpus reserved word accepts a count value of ″all.″ This indicates that the initial resource count will be obtained from the Startd machine update value for CPUs. v If no resources are defined for a machine, then no enforcement will be done on that machine. v If the count specified by the administrator is greater than what the Startd update indicates, the initial count value will be reduced to match what the Startd reports. | v For CPUs and real memory, if the count specified by the administrator is less | than what the Startd update indicates, the WLM resource shares assigned to | a job will be adjusted to represent that difference. In addition, a WLM | softlimit will be defined for each WLM class. For example, if the | administrator defines 8 CPUs on a 16 CPU machine, then a job requesting 4 | CPUs will get a share of 4 and a softlimit of 50%. v Use caution when determining the amount of real memory available for user jobs. A certain percentage of a machine’s real memory will be dedicated to the Default and System WLM classes and will not be included in the calculation of real memory available for users jobs. Start LoadLeveler with the ENFORCE_RESOURCE_USAGE keyword enabled and issue wlmstat -v -m. Look at the npg column to determine how much memory is being used by these classes. | v ConsumableVirtualMemory and ConsumableLargePageMemory are hard | max limit values. | – AIX WLM considers the ConsumableVirtualMemory value to be real | memory plus large page plus swap space. | – The ConsumableLargePageMemory value should be a value equal to the | multiple of the pagesize. For example, 16MB (page size) * 4 pages = 64MB. | 6. Decide if all jobs should have their CPU, real memory, virtual memory, or large | page resources enforced and then define the ENFORCE_RESOURCE_SUBMISSION global configuration keyword. v If the value specified is true, LoadLeveler will check all jobs at submission time for the resources and node_resources keywords. To be submitted, either the job’s resources or node_resources keyword must have the same resources specified as the ENFORCE_RESOURCE_USAGE keyword. 138 TWS LoadLeveler: Using and Administering
  • 159. v If the value specified is false, no checking is performed and jobs submitted without the resources or node_resources keyword will not have resources enforced and it might interfere with other jobs whose resources are enforced. v To support existing job command files without the resources or node_resources keyword, the default_resources and default_node_resources keywords in the class stanza can be defined. For more information on the ENFORCE_RESOURCE_USAGE and the ENFORCE_RESOURCE_SUBMISSION keywords, see “Defining usage policies for consumable resources” on page 60. LoadLeveler support for checkpointing jobs Checkpointing is a method of periodically saving the state of a job step so that if the step does not complete it can be restarted from the saved state. When checkpointing is enabled, checkpoints can be initiated from within the application at major milestones, or by the user, administrator or LoadLeveler external to the application. Both serial and parallel job steps can be checkpointed. Once a job step has been successfully checkpointed, if that step terminates before completion, the checkpoint file can be used to resume the job step from its saved state rather than from the beginning. When a job step terminates and is removed from the LoadLeveler job queue, it can be restarted from the checkpoint file by submitting a new job and setting the restart_from_ckpt = yes job command file keyword. When a job is terminated and remains on the LoadLeveler job queue, such as when a job step is vacated, the job step will automatically be restarted from the latest valid checkpoint file. A job can be vacated as a result of flushing a node, issuing checkpoint and hold, stopping or recycling LoadLeveler or as the result of a node crash. To find out more about checkpointing jobs, use the information in Table 30. Table 30. Roadmap of tasks for checkpointing jobs Subtask Associated instructions (see . . . ) Preparing the LoadLeveler v “Checkpoint keyword summary” environment for v “Planning considerations for checkpointing jobs” on page checkpointing and restarting 140 jobs v “AIX checkpoint and restart limitations” on page 141 v “Naming checkpoint files and directories” on page 145 Checkpointing and restarting v “Checkpointing a job” on page 232 jobs v “Removing old checkpoint files” on page 146 Correctly specifying v Chapter 12, “Configuration file reference,” on page 263 configuration and v Chapter 13, “Administration file reference,” on page 321 administration file keywords Checkpoint keyword summary There are keywords associated with the checkpoint and restart function. The following is a summary of keywords associated with the checkpoint and restart function. v Configuration file keywords Chapter 6. Performing additional administrator tasks 139
  • 160. CKPT_CLEANUP_INTERVAL – CKPT_CLEANUP_PROGRAM – CKPT_EXECUTE_DIR – MAX_CKPT_INTERVAL – MIN_CKPT_INTERVAL For more information about these keywords, see Chapter 12, “Configuration file reference,” on page 263. v Administration file keywords – ckpt_dir – ckpt_time_limit For more information about these keywords, see Chapter 13, “Administration file reference,” on page 321. v Job command file keywords – checkpoint – ckpt_dir – ckpt_execute_dir – ckpt_file – ckpt_time_limit – restart_from_ckpt For more information about these keywords, see “Job command file keyword descriptions” on page 359. Planning considerations for checkpointing jobs There are guidelines to review before you submit a checkpointing job. Review the following guidelines before you submit a checkpointing job: v Plan for jobs that you will restart on different nodes If you plan to migrate jobs (restart jobs on a different node or set of nodes), you should understand the difference between writing checkpoint files to a local file system versus a global file system (such as AFS or GPFS™). The ckpt_file and ckpt_dir keywords in the job command and configuration files allow you to write to either type of file system. If you are using a local file system, before restarting the job from checkpoint, make certain that the checkpoint files are accessible from the machine on which the job will be restarted. v Reserve adequate disk space A checkpoint file requires a significant amount of disk space. The checkpoint will fail if the directory where the checkpoint file is written does not have adequate space. For serial jobs, one checkpoint file will be created. For parallel jobs, one checkpoint file will be created for each task. Since the old set of checkpoint files are not deleted until the new set of files are successfully created, the checkpoint directory should be large enough to contain two sets of checkpoint files. You can make an accurate size estimate only after you have run your job and noticed the size of the checkpoint file that is created. v Plan for staging executables If you want to stage the executable for a job step, use the ckpt_execute_dir keyword to define the directory where LoadLeveler will save the executable. This directory cannot be the same as the current location of the executable file, or LoadLeveler will not stage the executable. You may define the ckpt_execute_dir keyword in either the configuration file or the job command file. To decide where to define the keyword, use the information in Table 31 on page 141. 140 TWS LoadLeveler: Using and Administering
  • 161. Table 31. Deciding where to define the directory for staging executables If the ckpt_execute_dir keyword is defined in: Then the following information applies: The configuration file only v LoadLeveler stages the executable file in a new subdirectory of the specified directory. The name of the subdirectory is the job step ID. v The user is the owner of the subdirectory and has permission 700. v If the user issues the llckpt command with the -k option, LoadLeveler deletes the staged executable. v LoadLeveler will delete the subdirectory and the staged executable when the job step ends. The job command file only v LoadLeveler stages the executable file in the directory specified in the job command file. v The user is the owner of the file and has execute permission Both the configuration and for it. job command files v The user is responsible for deleting the staged file after the job step ends. Neither file (the keyword LoadLeveler does not stage the executable file for the job step. is not defined) v Set your checkpoint file size to the maximum To make sure that your job can write a large checkpoint file, assign your job to a job class that has its file size limit set to the maximum (unlimited). In the administration file, set up a class stanza for checkpointing jobs with the following entry: file_limit = unlimited,unlimited This statement specifies that there is no limit on the maximum size of a file that your program can create. v Choose a unique checkpoint file name To prevent another job step from writing over your checkpoint file with another checkpoint file, make certain that your checkpoint file name is unique. The ckpt_dir and ckpt_file keywords give you control over the location and name of these files. For mode information, see “Naming checkpoint files and directories” on page 145. AIX checkpoint and restart limitations There are limitations associated with checkpoint and restart. v The following items cannot be checkpointed: – Programs that are being run under: - The dynamic probe class library (DPCL). - Any debugger. – MPI programs that are not compiled with mpcc_r, mpCC_r, mpxlf_r, mpxlf90_r, or mpxlf95_r. – Processes that use: - Extended shmat support - Pinned shared memory segments | - The debug malloc tool (MALLOCTYPE=debug) – Sets of processes in which any process is running a setuid program when a checkpoint occurs. – Sets of processes if any process is running a setgid program when a checkpoint occurs. Chapter 6. Performing additional administrator tasks 141
  • 162. – Interactive parallel jobs for which POE input or output is a pipe. – Interactive parallel jobs for which POE input or output is redirected, unless the job is submitted from a shell that had the CHECKPOINT environment variable set to yes before the shell was started. If POE is run from inside a shell script and is run in the background, the script must be started from a shell started in the same manner for the job to be checkpointable. – Interactive POE jobs for which the su command was used prior to checkpointing or restarting the job. v The node on which a process is restarted must have: – The same operating system level (including PTFs). In addition, a restarted process may not load a module that requires a system call from a kernel extension that was not present at checkpoint time. – The same switch type as the node where the checkpoint occurred. If any threads in a process were bound to a specific processor ID at checkpoint time, that processor ID must exist on the node where that process is restarted. v If the LoadLeveler cluster contains nodes running a mix of 32-bit and 64-bit kernels then applications must be checkpointed and restarted on the same set of nodes. For more information, see “llckpt - Checkpoint a running job step” on page 430 and the restart_on_same_nodes keyword description. v For a parallel job, the number of tasks and the task geometry (the tasks that are common within a node) must be the same on a restart as it was when the job was checkpointed. v Any regular file open in a process when it is checkpointed must be present on the node where that process is restarted, including the executable and any dynamically loaded libraries or objects. v If any process uses sockets or pipes, user callbacks should be registered to save data that may be ″in flight″ when a checkpoint occurs, and to restore the data when the process is resumed after a checkpoint or restart. Similarly, any user shared memory in a parallel task should be saved and restored. v A checkpoint operation will not begin on a process until each user thread in that process has released all pthread locks, if held. This can potentially cause a significant delay from the time a checkpoint is issued until the checkpoint actually occurs. Also, any thread of a process that is being checkpointed that does not hold any pthread locks and tries to acquire one will be stopped immediately. There are no similar actions performed for atomic locks (_check_lock and _clear_lock, for example). v Atomic locks must be used in such a way that they do not prevent the releasing of pthread locks during a checkpoint. For example, if a checkpoint occurs and thread 1 holds a pthread lock and is waiting for an atomic lock, and thread 2 tries to acquire a different pthread lock (and does not hold any other pthread locks) before releasing the atomic lock that is being waited for in thread 1, the checkpoint will hang. v A process must not hold a pthread lock when creating a new process (either implicitly using popen, for example, or explicitly using fork) if releasing the lock is contingent on some action of the new process. Otherwise, a checkpoint could occur which would cause the child process to be stopped before the parent could release the pthread lock causing the checkpoint operation to hang. v The checkpoint operation will hang if any user pthread locks are held across: – Any collective communication calls in MPI or LAPI – Calls to mpc_init_ckpt or mp_init_ckpt v Processes cannot be profiled at the time a checkpoint is taken. v There can be no devices other than TTYs or /dev/null open at the time a checkpoint is taken. 142 TWS LoadLeveler: Using and Administering
  • 163. v Open files must either have an absolute path name that is less than or equal to PATHMAX in length, or must have a relative path name that is less than or equal to PATHMAX in length from the current directory at the time they were opened. The current directory must have an absolute path name that is less than or equal to PATHMAX in length. v Semaphores or message queues that are used within the set of processes being checkpointed must only be used by processes within the set of processes being checkpointed. This condition is not verified when a set of processes is checkpointed. The checkpoint and restart operations will succeed, but inconsistent results can occur after the restart. v The processes that create shared memory must be checkpointed with the processes using the shared memory if the shared memory is ever detached from all processes being checkpointed. Otherwise, the shared memory may not be available after a restart operation. v The ability to checkpoint and restart a process is not supported for B1 and C2 security configurations. v A process can only checkpoint another process if it can send a signal to the process. In other words, the privilege checking for checkpointing processes is identical to the privilege checking for sending a signal to the process. A privileged process (the effective user ID is 0) can checkpoint any process. A set of processes can only be checkpointed if each process in the set can be checkpointed. v A process can only restart another process if it can change its entire privilege state (real, saved, and effective versions of user ID, group ID, and group list) to match that of the restarted process. A set of processes can only be restarted if each process in the set can be restarted. v The only DCE function supported is DCE credential forwarding by LoadLeveler using the DCE_AUTHENTICATION_PAIR configuration keyword. DCE credential forwarding is for the sole purpose of DFS™ access by the application. v If a process invokes any Network Information Service (NIS) functions, from then on, AIX will delay the start of a checkpoint of a process until the process returns from any system calls. v Jobs in which the message passing application is not a direct child of the Partition Manager Daemon (pmd) cannot be checkpointed. | v Scale-across jobs cannot be checkpointed. v The following functions will return ENOTSUP if called in a job that has enabled checkpointing: – clock_getcpuclockid() – clock_getres() – clock_gettime() – clock_nanosleep() – clock_settime() – mlock() – mlockall() – mq_close() – mq_getattr() – mq_notify() – mq_open() – mq_receive() – mq_send() – mq_setattr() – mq_timedreceive() – mq_timedsend() Chapter 6. Performing additional administrator tasks 143
  • 164. mq_unlink() – munlock() – munlockall() – nanosleep() – pthread_barrier_destroy() – pthread_barrier_init() – pthread_barrier_wait() – pthread_barrierattr_destroy() – pthread_barrierattr_getpshared() – pthread_barrierattr_init() – pthread_barrierattr_setpshared() – pthread_condattr_getclock() – pthread_condattr_setclock() – pthread_getcpuclockid() – pthread_mutex_getprioceiling() – pthread_mutex_setprioceiling() – pthread_mutex_timedlock() – pthread_mutexattr_getprioceiling() – pthread_mutexattr_getprotocol() – pthread_mutexattr_setprioceiling() – pthread_mutexattr_setprotocol() – pthread_rwlock_timedrdlock() – pthread_rwlock_timedwrlock() – pthread_setschedprio() – pthread_spin_destroy() – pthread_spin_init() – pthread_spin_lock() – pthread_spin_trylock() – pthread_spin_unlock() – sched_get_priority_max() – sched_get_priority_min() – sched_getparam() – sched_getscheduler() – sched_rr_get_interval() – sched_setparam() – sched_setscheduler() – sem_close() – sem_destroy() – sem_getvalue() – sem_init() – sem_open() – sem_post() – sem_timedwait() – sem_trywait() – sem_unlink() – sem_wait() – shm_open() – shm_unlink() – timer_create() – timer_delete() – timer_getoverrun() – timer_gettime() – timer_settime() 144 TWS LoadLeveler: Using and Administering
  • 165. Naming checkpoint files and directories At checkpoint time, a checkpoint file and potentially an error file will be created. For jobs which are enabled for checkpoint, a control file may be generated at the time of job submission. The directory which will contain these files must pre-exist and have sufficient space and permissions for these files to be written. The name and location of these files will be controlled through keywords in the job command file or the LoadLeveler configuration. The file name specified is used as a base name from which the actual checkpoint file name is constructed. To prevent another job step from writing over your checkpoint file, make certain that your checkpoint file name is unique. For serial jobs and the master task (POE) of parallel jobs, the checkpoint file name will be <basename>.Tag. For a parallel job, a checkpoint file is created for each task. The checkpoint file name will be <basename>.Taskid.Tag. The tag is used to differentiate between a current and previous checkpoint file. A control file may be created in the checkpoint directory. This control file contains information LoadLeveler uses for restarting certain jobs. An error file may also be created in the checkpoint directory. The data in this file is in a machine readable format. The information contained in the error file is available in mail, LoadLeveler logs or is output of the checkpoint command. Both of these files are named with the same base name as the checkpoint file with the extensions .cntl and .err, respectively. Naming checkpoint files for serial and batch parallel jobs There is an order in which keywords are checked to construct the full path name for a serial or batch checkpoint file. The following describes the order in which keywords are checked to construct the full path name for a serial or batch checkpoint file: v Base name for the checkpoint file name 1. The ckpt_file keyword in the job command file 2. The default file name [< jobname.>]<job_step_id>.ckpt Where: jobname The job_name specified in the Job Command File. If job_name is not specified, it is omitted from the default file name job_step_id Identifies the job step that is being checkpointed v Checkpoint Directory Name 1. The ckpt_file keyword in the job command file, if it contains a ″/″ as the first character 2. The ckpt_dir keyword in the job command file 3. The ckpt_dir keyword specified in the class stanza of the LoadLeveler admin file 4. The default directory is the initial working directory Note that two or more job steps running at the same time cannot both write to the same checkpoint file, since the file will be corrupted. Naming checkpointing files for interactive parallel jobs There is an order in which keywords and variables are checked to construct the full path name for the checkpoint file for an interactive parallel job. Chapter 6. Performing additional administrator tasks 145
  • 166. The following describes the order in which keywords and variables are checked to construct the full path name for the checkpoint file for an interactive parallel job. v Checkpoint File Name 1. The value of the MP_CKPTFILE environment variable within the POE process 2. The default file name, poe.ckpt.<pid> v Checkpoint Directory Name 1. The value of the MP_CKPTFILE environment variable within the POE process, if it contains a full path name. 2. The value of the MP_CKPTDIR environment variable within the POE process. 3. The initial working directory. Note: The keywords ckpt_dir and ckpt_file are not allowed in the command file for an interactive session. If they are present, they will be ignored and the job will be submitted. Removing old checkpoint files LoadLeveler provides two keywords to help automate the process of removing checkpoint files that are no longer necessary. To keep your system free of checkpoint files that are no longer necessary, LoadLeveler provides two keywords to help automate the process of removing these files: v CKPT_CLEANUP_PROGRAM v CKPT_CLEANUP_INTERVAL Both keywords must contain valid values to automate this process. For information about configuration file keyword syntax and other details, see Chapter 12, “Configuration file reference,” on page 263. LoadLeveler scheduling affinity support LoadLeveler offers a number of scheduling affinity options. LoadLeveler offers the following scheduling affinity options: v Memory and adapter affinity v Processor affinity Enabling scheduling affinity allows LoadLeveler jobs to utilize performance improvement from multiple chip modules (MCMs) (memory and adapter) and processor affinities. If enabled, LoadLeveler will schedule and attach the appropriate CPUs in the cluster to the job tasks in order to maximize performance improvement based on the type of affinity requested by the job. Memory and adapter affinity Memory affinity is a special purpose option for improving performance on IBM POWER6™, POWER5™, and POWER4™ processor-based systems. These machines contain MCMs, each containing multiple processors. System memory is attached to these MCMs. While any processor can access all of the memory in the system, a processor has faster access and higher bandwidth when addressing memory that is attached to its own MCM rather than memory attached to the other MCMs in the system. The concept of affinity also applies to the I/O subsystem. The processes running on CPUs from an MCM have faster access to the adapters attached to the 146 TWS LoadLeveler: Using and Administering
  • 167. I/O slots of that MCM. I/O affinity will be referred to as adapter affinity in this topic. For more information about memory and adapter affinity, see AIX Performance Management Guide. | Processor affinity | LoadLeveler provides processor affinity options to improve job performance on the | following platforms: | v IBM POWER6 and POWER5 processor-based systems running in simultaneous | multithreading (SMT) mode with AIX or Linux | v IBM POWER6 and POWER5 processor-based systems running in Single | Threaded (ST) mode with AIX or Linux | v IBM POWER4 processor-based systems with AIX or Linux | v x86 and x86_64 processor-based systems with Linux | On AIX, affinity support is implemented by using a Resource Set (RSet), which | contains bit maps for CPU and memory pool resources. The RSet APIs available in | AIX can be used to attach RSets to processes. Attaching an RSet to a process limits | the process to only using the resources contained in the RSet. One of the main uses | of RSets is to limit the application processes to run only on the processors | contained in a single MCM and hence to benefit from memory affinity. For more | details on RSets, refer to AIX System Management Guide: Operating System and | Devices. | On Linux on Power systems, affinity support is implemented by using ″cpusets,″ | which provide a mechanism for assigning a set of CPUs and memory nodes | (MCMs) to a set of tasks. The cpusets constrain the CPU and memory placement of | tasks to only the resources within a task’s current cpuset. The cpusets are managed | by the virtual file system type cpuset. Before configuring LoadLeveler to support | affinity, the cpuset virtual file system must be created on every machine in the | cluster to enable affinity support. | On Linux on x86 and x86_64 systems, affinity support is implemented by using the | sched_setaffinity Linux-specific system call to assign a set of physical or logical | CPUs to the job processes. Configuring LoadLeveler to use scheduling affinity On AIX and Linux on Power systems, scheduling affinity can be enabled by using the RSET_SUPPORT configuration file keyword. Machines that are configured with this keyword indicate the ability to service jobs requesting or requiring scheduling affinity. | Enable RSET_SUPPORT with one of these values: | v Choose RSET_MCM_AFFINITY to allow jobs specifying rset = | RSET_MCM_AFFINITY or the task_affinity keyword to run on a node. When | rset = RSET_MCM_AFFINITY, LoadLeveler will select and attach sets of CPUs | to task processes such that a set of CPUs will be from the same MCM. When the | task_affinity keyword is used, LoadLeveler will select CPUs regardless of their | location with respect to an MCM. | v Choose RSET_USER_DEFINED to allow jobs specifying a user-defined RSet | name for rset to run on a node. The RSET_USER_DEFINED option enables | scheduling affinity, allowing users more control over scheduling affinity | parameters by allowing the use of user-defined RSets. Through the use of | user-defined RSets, users can utilize new RSet features before a LoadLeveler Chapter 6. Performing additional administrator tasks 147
  • 168. | implementation is released. This option also allows users to specify a different | number of CPUs in their RSets depending on the needs of each task. This value | is supported only on AIX machines. Note: | 1. Because LoadLeveler creates a cpuset for each task requesting affinity | under the /dev/cpuset directory on Linux on POWER machines, the | cpuset virtual file system must be created and mounted on the | /dev/cpuset directory by issuing the following commands on each node: | # mkdir /dev/cpuset | # mount -t cpuset none /dev/cpuset | 2. A virtual file system of type cpuset mounted at /dev/cpuset will be | deleted when the node is rebooted. To create the /dev/cpuset directory | and have the virtual cpuset file system mounted on it automatically | when the node is rebooted, add the following commands to your | start-up script (for example, /etc/init.d/boot.local), which is run when the | node is rebooted or started: | if test -e /dev/cpuset || mkdir -p /dev/cpuset ; then | mount -t cpuset none /dev/cpuset | fi | See “Configuration file keyword descriptions” on page 265 for more information | on the RSET_SUPPORT keyword. | On AIX and Linux on Power systems, jobs requesting processor affinity with the | task_affinity keyword in the job command file will only run on machines where | the resource statement in the machine stanza in the LoadLeveler administration file | contains the ConsumableCpus keyword. For more information on specifying | ConsumableCpus, see the resource keyword description in “Administration file | keyword descriptions” on page 327. | On Linux on x86 and x86_64 systems, exclusive allocation of CPUs to job steps is | enabled by using the ALLOC_EXCLUSIVE_CPU_PER_JOB configuration file | keyword. Enable ALLOC_EXCLUSIVE_CPU_PER_JOB with one of these values: | v Choose the PHYSICAL option to allow LoadLeveler to assign tasks to physical | processor packages. The PHYSICAL option allows LoadLeveler to treat | hyperthreaded processors and multicore processors as a single unit so that a job | has dedicated computing resources. For example, a node with two Intel x86 | processors with hyperthreading turned ON, will be treated as a node with two | physical processors. Similarly, a node with two dual-core AMD Opteron | processors will be treated as a node with two physical processors. | v Choose the LOGICAL option to allow LoadLeveler to assign tasks to processor | units. For example, a node with two Intel x86 processors with hyperthreading | turned ON will be treated as a node with four processors. A node with two | dual-core AMD Opteron processors will be treated as a node with four | processors. | See “Configuration file keyword descriptions” on page 265 for more information | on the ALLOC_EXCLUSIVE_CPU_PER_JOB keyword. LoadLeveler multicluster support To provide a more scalable runtime environment and more efficient workload balancing, you may configure a LoadLeveler multicluster environment. 148 TWS LoadLeveler: Using and Administering
  • 169. A LoadLeveler multicluster environment consists of two or more LoadLeveler clusters, grouped together through network connections that allow the clusters to share resources. These clusters may be AIX, Linux, or mixed clusters. Within a LoadLeveler multicluster environment: v The local cluster is the cluster from which the user submits jobs or issues commands. v A remote cluster is a cluster that accepts job submissions and commands from the local cluster. v A local gateway Schedd is a Schedd within the local cluster serving as an inbound point from some remote cluster, an outbound point to some remote cluster, or both. v A remote gateway Schedd is a Schedd within a remote cluster serving as an inbound point from the local cluster, an outbound point to the local cluster, or both. v A local central manager is the central manager in the same cluster as the local gateway Schedd. v A remote central manager is the central manager in the same cluster as a remote gateway Schedd. A LoadLeveler multicluster environment addresses scalability and workload balancing issues by providing the ability to: v Distribute workload among LoadLeveler clusters when jobs are submitted. v Easily access multiple LoadLeveler cluster resources. v Display information about the multicluster. v Monitor and control operations in a multicluster. v Transfer idle jobs from one cluster to another. v Transfer user input and output files between clusters. v Enable LoadLeveler to operate in a secure environment where clusters are separated by a firewall. Table 32 shows the multicluster support subtasks with a pointer to the associated instructions: Table 32. Multicluster support subtasks and associated instructions Subtask Associated instructions (see . . . ) Configure a LoadLeveler multicluster “Configuring a LoadLeveler multicluster” on page 150 Submit and monitor jobs in a LoadLeveler “Submitting and monitoring jobs in a multicluster LoadLeveler multicluster” on page 223 | Scale-across scheduling “Scale-across scheduling with multiclusters” on page 153 Table 33. Multicluster support related topics Related topics Additional information (see . . . ) Administration file: Cluster stanzas “Defining clusters” on page 100 Administration file: Cluster keywords “Administration file keyword descriptions” on page 327 Configuration file: Cluster keywords “Configuration file keyword descriptions” on page 265 Job command file: Cluster keywords “Job command file keyword descriptions” on page 359 Chapter 6. Performing additional administrator tasks 149
  • 170. Table 33. Multicluster support related topics (continued) Related topics Additional information (see . . . ) Commands and APIs Chapter 16, “Commands,” on page 411 or Chapter 17, “Application programming interfaces (APIs),” on page 541 Diagnosis and messages TWS LoadLeveler: Diagnosis and Messages Guide Configuring a LoadLeveler multicluster These are the subtasks for configuring a LoadLeveler multicluster. Table 34 lists the subtasks for configuring a LoadLeveler multicluster. Table 34. Subtasks for configuring a LoadLeveler multicluster Subtask Associated instructions (see . . . ) Configure the v “Steps for configuring a LoadLeveler multicluster” on page 151 LoadLeveler v “Steps for securing communications within a LoadLeveler multicluster multicluster” on page 153 environment Display information v Use the llstatus command: about the LoadLeveler – With the -X option to display information about machines in multicluster the multicluster. environment – With the -C option to display information defined in cluster stanzas in the administration file. v Use the llclass command with the -X option to display information about classes on any cluster (local or remote). v Use the llq command with the -X option to display information about jobs on any cluster (local or remote). 150 TWS LoadLeveler: Using and Administering
  • 171. Table 34. Subtasks for configuring a LoadLeveler multicluster (continued) Subtask Associated instructions (see . . . ) Monitor and control Existing LoadLeveler user commands accept the -X option for a operations in the multicluster environment. LoadLeveler multicluster Rules: environment v Administrator only commands are not applicable in a multicluster environment. v The options -x, -W, -s, and -p cannot be specified together with the -X option on the llmodify command. v The options -x and -w cannot be specified together with the -X option on the llq command. v The -X option on the following commands is restricted to a single cluster: – llcancel – llckpt – llhold – llmodify – llprio v The following commands are not applicable in a multicluster environment: – llacctmrg – llchres – llextRPD – llinit – llmkres – llqres – llrmres – llrunscheduler – llsummary Steps for configuring a LoadLeveler multicluster The primary task for configuring a LoadLeveler multicluster environment is to enable communication between gateway Schedd daemons on all of the clusters in the multicluster. To do so requires defining each Schedd daemon as either local or remote, and defining the inbound and outbound hosts with which the daemon will communicate. Before you begin: You need to know that: v A single machine may be defined as an inbound or outbound host, or as both. v A single cluster must belong to only one multicluster. v A single multicluster must consist of 10 or fewer clusters. v Clusters must have unique host names within the multicluster network domain space. v The inbound Schedd becomes the schedd_host of all remote jobs it receives. Perform the following steps to configure a LoadLeveler multicluster: 1. In the administration file, define one cluster stanza for each cluster in the LoadLeveler multicluster environment. Rules: v You must define one cluster as the local cluster. v You must code the following required cluster-stanza keywords and variable values: Chapter 6. Performing additional administrator tasks 151
  • 172. cluster_name: type=cluster outbound_hosts = hostname[(cluster_name)] inbound_hosts = hostname[(cluster_name)] v If you want to allow users to submit remote jobs to the local cluster, the list of inbound hosts must include the name of the inbound Schedd and the cluster you are defining as remote or you must specify the name of an inbound Schedd without any cluster specification so that it defaults to being an inbound Schedd for all clusters. v If the configuration file keyword SCHEDD_STREAM_PORT for any cluster is set to use a port other than the default value of 9605, you must set the inbound_schedd_port keyword in the cluster stanza for that cluster. 2. (Optional) If the local cluster wants to provide job distribution where users allow LoadLeveler to select the appropriate cluster for job submission based on administration defined objectives, then define an installation exit to be executed at submit time using the CLUSTER_METRIC configuration keyword. You can use the LoadLeveler data access APIs in this exit to query other clusters for information about possible metrics, such as the number of jobs in a specified job class, the number of jobs in the idle queue, or the number of free nodes in the cluster. For more detailed information, see CLUSTER_METRIC. Tip: LoadLeveler provides a set of sample exits for you to use as models. These samples are in the ${RELEASEDIR}/samples/llcluster directory. 3. (Optional) If the local cluster wants to perform user mapping on jobs arriving from remote clusters, define the CLUSTER_USER_MAPPER configuration keyword. For more information, see CLUSTER_USER_MAPPER. 4. (Optional) If the local cluster wants to perform job filtering on jobs received from remote clusters, define the CLUSTER_REMOTE_JOB_FILTER configuration keyword. For more information, see CLUSTER_REMOTE_JOB_FILTER. 5. Notify LoadLeveler daemons by issuing the llctl command with either the reconfig or recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the administration file. Additional considerations: v Remote jobs are subjected to the same configuration checks as locally submitted jobs. Examples include account validation, class limits, include lists, and exclude lists. v Remote jobs will be processed by the local submit_filter prior to submission to a remote cluster. v Any tracker program specified in the API parameters will be invoked upon the scheduling cluster nodes. v If a step is enabled for checkpoint and the ckpt_execute_dir is not specified, LoadLeveler will not copy the executable to the remote cluster, the user must ensure that executable exists on the remote cluster. If the executable is not in a shared file system, the executable can be copied to the remote cluster using the cluster_input_file job command file keyword. v If the job command file is also the executable and the job is submitted or moved to a remote cluster, the $(executable) variable will contain the full path name of the executable on the local cluster from which it came. This differs from the behavior on the local cluster, where the $(executable) variable will be the command line argument passed to the llsubmit command. If you only want the file name, use the $(base_executable) variable. 152 TWS LoadLeveler: Using and Administering
  • 173. Steps for securing communications within a LoadLeveler multicluster Configuring LoadLeveler to use the OpenSSL library enables it to operate in a secure environment where clusters are separated by a firewall. Perform the following steps to configure LoadLeveler to use OpenSSL in a multicluster environment: 1. Install SSL using the standard platform installation process. 2. Ensure a link exists from the installed SSL library to: a. /usr/lib/libssl.so for 32-bit Linux platforms. b. /usr/lib64/libssl.so for 64-bit Linux platforms. c. /usr/lib/libssl.a for AIX platforms. 3. Create the SSL authorization keys by invoking the llclusterauth command with the -k option on all local gateway schedds. Result: LoadLeveler creates a public key, a private key, and a security certificate for each gateway node. 4. Distribute the public keys to remote gateway schedds on other secure clusters. This is done by exchanging the public keys with the other clusters you wish to communicate with. v for AIX, public keys can be found in the /var/LoadL/ssl/id_rsa.pub file. v for Linux, public keys can be found in the /var/opt/LoadL/ssl/id_rsa.pub file. 5. Copy the public keys of the clusters you wish to communicate with into the authorized_keys directory on your inbound Schedd nodes. v for AIX, /var/LoadL/ssl/authorized_keys v for Linux, /var/opt/LoadL/ssl/authorized_keys v The authorization key files can be named anything within the authorized_keys directory. 6. Define the cluster stanzas within the LoadLeveler administration file, using the multicluster_security = SSL keyword. Define the keyword ssl_cipher_list if a specific OpenSSL cipher encryption method is desired. Use secure_schedd_port to define the port number to be used for secure inbound transactions to the cluster. 7. Notify LoadLeveler daemons by issuing the llctl -g command with the recycle keyword. Otherwise, LoadLeveler will not process the modifications you made to the administration file. 8. Configure firewalls to accept connections to the secure_schedd_port numbers you defined in the administration file. | Scale-across scheduling with multiclusters | In the multicluster environment, scale-across scheduling allows you to schedule | jobs across more than one cluster. This feature allows large jobs that request more | resources than a single cluster can provide to combine resources from more than | one cluster and run large jobs on the combined resources. effectively spanning | resources across more than one cluster. | By effectively spanning resources across more than one cluster, scale-across | scheduling also allows utilization of fragmented resources from more than one | cluster. Fragmented resources occur when the resources available on a single | cluster cannot satisfy any single job on that cluster. This feature allows any size job | to take advantage of these resources by combining them from multiple clusters. Chapter 6. Performing additional administrator tasks 153
  • 174. | The following are not supported with scale-across scheduling: | v Checkpointing jobs | v Coscheduled jobs | v Data staging jobs | v Hostlist jobs | v IBM Blue Gene Systems resources jobs | v Interactive Parallel Operating Environment (POE) | v Multistep jobs | v Preemption of scale-across jobs | v Reservations | v Secure Sockets Layer (SSL) | v Task-geometry jobs | v User space jobs | Requirements for scale-across scheduling | Main Cluster | In a multicluster environment that supports scale-across scheduling, one of | the clusters in the multicluster environment must be designated as the | ″main cluster.″ The main cluster will only schedule scale-across jobs; it will | not run any jobs. Scale-across jobs will run on non-main clusters. | Network Connectivity | A requirement for any cluster that will participate in scale-across | scheduling is that any node in one cluster must be able to communicate | with any other node in any other cluster that is part of the scale-across | configuration. There are two reasons for this requirement: | v Since the main cluster initiates the scale-across job, one node in the main | cluster must have connectivity to any node in any of the other clusters | where the job will run. | v Tasks of parallel applications must communicate with other tasks | running on different nodes. | Configuring LoadLeveler for scale-across scheduling | After you choose a set of clusters to participate in scale-across scheduling, you | must designate one cluster as the main cluster. Do so by specifying a value of true | in the main_scale_across_cluster keyword for that cluster’s stanza in the | administration files of all scale-across clusters. The cluster that specifies this | keyword as true for its own cluster stanza becomes the main cluster. Any cluster | that specifies this keyword as true for another cluster stanza becomes a non-main | cluster. | Table 35 lists scale-across scheduling keywords: | Table 35. Keywords for configuring scale-across scheduling | Keyword type Keyword reference | | Administration file keywords allow_scale_across_jobs cluster stanza keyword | main_scale_across_cluster cluster stanza keyword | allow_scale_across_jobs class stanza keyword | | Configuration file keyword SCALE_ACROSS_SCHEDULING_TIMEOUT keyword | 154 TWS LoadLeveler: Using and Administering
  • 175. | Tuning considerations for scale-across scheduling | NEGOTIATOR_CYCLE_DELAY | The value on both the main and the non-main clusters should be set to | similar values to minimize the wait delays on both the main and the | non-main clusters that occur when the main cluster is requesting a | negotiator cycle on the non-main clusters. It is reasonable to set | NEGOTIATOR_CYCLE_DELAY=1 on all clusters. | MAX_TOP_DOGS | The maximum number of top-dog scale-across jobs allowed on the main | cluster should be smaller than the maximum number of top-dog jobs | allowed on the non-main clusters to allow the non-main clusters to | schedule both the scale-across and regular jobs as top dogs. | SCALE_ACROSS_SCHEDULING_TIMEOUT | The default value should be overridden only if there are non-main clusters | that have extremely long dispatch cycles or that have very long | NEGOTIATOR_CYCLE_DELAY values. In these cases, the | SCALE_ACROSS_SCHEDULING_TIMEOUT needs to be set to a value | greater than those intervals. | LoadLeveler Blue Gene support Blue Gene is a massively parallel system based on a scalable cellular architecture which exploits a very large number of tightly interconnected compute nodes (C-nodes). | To take advantage of Blue Gene support, you must be using the LoadLeveler | BACKFILL scheduler. With the BACKFILL scheduler, LoadLeveler enables the Blue | Gene system to take advantage of reservations that allow you to schedule when, | and with which resources a job will run. While LoadLeveler Blue Gene support is available on all platforms, Blue Gene®/L™ software is only supported on IBM POWER servers running SLES 9. This limitation currently restricts LoadLeveler Blue Gene/L support to SLES 9 on IBM POWER servers. LoadLeveler Blue Gene®/P™ software is only supported on IBM POWER servers running SLES 10. Mixed clusters of Blue Gene/L and Blue Gene/P systems are not supported. Terms you should know: v Compute nodes, also called C-nodes, are system-on-a-chip nodes that execute at most a single job at a time. All the C-nodes are interconnected in a three-dimensional toroidal pattern. Each C-node has a unique address and location in the three-dimensional toroidal space. Compute nodes execute the jobs’ tasks. Compute nodes run a minimal custom operating system called BLRTS. v Front End Nodes (FEN) are machines from which users and administrators interact with Blue Gene. Applications are compiled on and submitted for execution in the Blue Gene core from FENs. User interactions with applications, including debugging, are also performed from the FENs. v The Service Node is dedicated hardware that runs software to control and manage the Blue Gene system. v I/O nodes are special nodes that connect the compute nodes to the outside world. I/O nodes allow processes that are executing in the compute nodes to perform I/O operations, such as accessing files, and to communicate with the Chapter 6. Performing additional administrator tasks 155
  • 176. job management system. Each I/O node serves anywhere from 8 to 64 C-nodes, depending on the physical configuration. v mpirun is a program that is executed partly on the Front End Node, and partly on the Service Node. mpirun controls and monitors the parallel Blue Gene job. The mpirun program is executed by the user program that is run on the FEN by LoadLeveler. v A base partition (BP) is a group of compute nodes connected in a 3D rectangular pattern and their controlled I/O nodes. A base partition is one of the basic allocation units for jobs. For example, an allocation for the job will require at least one base partition, unless an allocation requests a small partition, in which case sub base partition allocation is possible. v A small partition is a group of C-nodes which are part of one base partition. Valid small partitions have size of 32 or 128 C-nodes. v A partition is a group of base partitions, switches, and switch states allocated to a job. A partition is predefined or is created on demand to execute a job. Partitions are physically (electronically) isolated from each other (for example, messages cannot flow outside an allocated partition). A partition can have the topology of a mesh or a torus. v The Control System is a component that serves as the interface to the Blue Gene system. It contains persistent storage with configuration and status information on the entire system. It also provides various services to perform actions on the Blue Gene system, such as launching a job. v A node card is a group of 32 compute nodes within a base partition. This is the minimal allocation size for a partition. v A quarter is a group of 4 node cards. This is a logical grouping of node cards within a base partition. A quarter, which is 128 compute nodes, is the next smallest allowed allocation size for a partition after a node card. v A switch state is a set of internal switch connections which physically ″wire″ the partition. A switch has a number of incoming and outgoing wires. An internal switch connection physically connects one incoming wire with one outgoing wire, setting up a communication path between base partitions. For more information about the Blue Gene system and Blue Gene terminology, refer to IBM System Blue Gene Solution documentation. Table 36 lists the IBM System Blue Gene Solution publications that are available from the IBM Redbooks® Web site at the following URLs: Table 36. IBM System Blue Gene Solution documentation Blue Gene System Publication Name URL Blue Gene/P IBM System Blue Gene Solution: Blue Gene/P http://www.redbooks.ibm.com/abstracts/ System Administration sg247417.html IBM System Blue Gene Solution: Blue Gene/P http://www.redbooks.ibm.com/abstracts/ Safety Considerations redp4257.html IBM System Blue Gene Solution: Blue Gene/P http://www.redbooks.ibm.com/abstracts/ Application Development sg247287.html Evolution of the IBM System Blue Gene Solution http://www.redbooks.ibm.com/abstracts/ redp4247.html 156 TWS LoadLeveler: Using and Administering
  • 177. Table 36. IBM System Blue Gene Solution documentation (continued) Blue Gene System Publication Name URL Blue Gene/L IBM System Blue Gene Solution: System http://www.redbooks.ibm.com/abstracts/ Administration sg247178.html Blue Gene/L: Hardware Overview and Planning http://www.redbooks.ibm.com/abstracts/ sg246796.html IBM System Blue Gene Solution: Application http://www.redbooks.ibm.com/abstracts/ Development sg247179.html Unfolding the IBM eServer™ Blue Gene Solution http://www.redbooks.ibm.com/abstracts/ sg246686.html Table 37 lists the Blue Gene subtasks with a pointer to the associated instructions: Table 37. Blue Gene subtasks and associated instructions Subtask Associated instructions (see . . . ) Configure LoadLeveler Blue Gene support “Configuring LoadLeveler Blue Gene support” Submit and monitor Blue Gene jobs “Submitting and monitoring Blue Gene jobs” on page 226 Table 38 lists the Blue Gene related topics and associated information: Table 38. Blue Gene related topics and associated information Related topic Associated information (see . . . ) Configuration file: Blue Gene keywords “Configuration file keyword descriptions” on page 265 Job command file: Blue Gene keywords “Job command file keyword descriptions” on page 359 Commands and APIs Chapter 16, “Commands,” on page 411 or Chapter 17, “Application programming interfaces (APIs),” on page 541 Diagnosis and messages TWS LoadLeveler: Diagnosis and Messages Guide Configuring LoadLeveler Blue Gene support This is a list of the subtasks for configuring LoadLeveler Blue Gene support along with a pointer to the associated instructions. Table 39 lists the subtasks for configuring LoadLeveler Blue Gene support along with a pointer to the associated instructions: Table 39. Blue Gene configuring subtasks and associated instructions Subtask Associated instructions (see . . . ) Configuring “Steps for configuring LoadLeveler Blue Gene support” on page 158 LoadLeveler Blue Gene support Chapter 6. Performing additional administrator tasks 157
  • 178. Table 39. Blue Gene configuring subtasks and associated instructions (continued) Subtask Associated instructions (see . . . ) Display information v Use the llstatus command with the -b option to display about the Blue Gene information about the Blue Gene system. The llstatus command system can also be used with the -B option to display information about Blue Gene base partitions. Using llstatus with the -P option can be used to display information about Blue Gene partitions. Display information v Use the llsummary command with the -l option to display job about Blue gene jobs resource information. v Use the llq command with the -b option to display information about all Blue Gene jobs. Steps for configuring LoadLeveler Blue Gene support The primary task for configuring LoadLeveler Blue Gene support consists of setting up the environment of the LoadL_negotiator daemon, the environment of any process that will run Blue Gene jobs, and the LoadLeveler configuration file. Perform the following steps to configure LoadLeveler Blue Gene support: 1. Configure the LoadL_negotiator daemon to run on a node which has access to the Blue Gene Control System. 2. Enable Blue Gene support by setting the BG_ENABLED configuration file keyword to true. 3. (Optional) Set any of the following additional Blue Gene related configuration file keywords which your setup requires: v BG_ALLOW_LL_JOBS_ONLY v BG_CACHE_PARTITIONS v BG_MIN_PARTITION_SIZE v CM_CHECK_USERID See “Configuration file keyword descriptions” on page 265 for more information on these keywords. 4. Set the required environment variables for the LoadL_negotiator daemon and any process that will run Blue Gene jobs. You can use global profiles to set the necessary environment variables for all users. Follow these steps to set environment variables for a LoadLeveler daemon: a. Add required environment variable settings to global profile. b. Set the environment as the administrator before invoking llctl start on the central manager node. c. Build a shell script which sets the required environments and starts LoadLeveler, which can be invoked using rsh remotely. Note: Using the llctl -h or llctl -g command to start the central manager remotely will not carry the environment variables from the login session to the LoadLeveler daemons on the remote nodes. v Specify the full path name of the bridge configuration file by setting the BRIDGE_CONFIG_FILE environment variable. For details on the contents of the bridge configuration file, see the Blue Gene/L: System Administration or Blue Gene/P: System Administration book. Example: For ksh: export BRIDGE_CONFIG_FILE=/var/bluegene/config/bridge.cfg 158 TWS LoadLeveler: Using and Administering
  • 179. For csh: setenv BRIDGE_CONFIG_FILE=/var/bluegene/config/bridge.cfg v Specify the full path name of the file containing the data required to access the Blue Gene Control System database by setting the DB_PROPERTY environment variable. For details on the contents of the database property file, see the Blue Gene/L: System Administration or Blue Gene/P: System Administration book. Example: For ksh: export DB_PROPERTY=/var/bluegene/config/db.cfg For csh: setenv DB_PROPERTY=/var/bluegene/config/db.cfg v Specify the host name of the machine running the Blue Gene control system by setting the MMCS_SERVER_IP environment variable. For details on the use of this environment variable, see the Blue Gene/L: System Administration or Blue Gene/P: System Administration book. Example: For ksh: export MMCS_SERVER_IP=bluegene.ibm.com For csh: setenv MMCS_SERVER_IP=bluegene.ibm.com Blue Gene reservation support | Reservation supports Blue Gene resources including the Blue Gene compute nodes. | It is important to note that when the reservation includes Blue Gene nodes, it | cannot include conventional nodes. A front end node (FEN), which is used to start | a Blue Gene job, is not part of the Blue Gene resources. A Blue Gene reservation | only reserves Blue Gene resources and a Blue Gene job step bound to a reservation | uses the reserved Blue Gene resources and shares a FEN outside the reservation. Jobs using Blue Gene resources can be submitted to a Blue Gene reservation to run. A Blue Gene job step can also be used to select what Blue Gene resources to reserve to make sure the reservation will have enough Blue Gene resources to run the Blue Gene job step. | For more information about reservations, see “Overview of reservations” on page | 25. Blue Gene fair share scheduling support Fair share scheduling has been extended to Blue Gene resources as well. The FAIR_SHARE_TOTAL_SHARES keyword in LoadL_config and the fair_shares keyword for the user and group stanza in LoadL_admin apply to both the CPU resources and the Blue Gene resources. When a Blue Gene job step ends, both the CPU utilization and the Blue Gene resource utilization data will be collected. The elapsed job running time multiplied by the number of C-nodes allocated to the job step (the Size Allocated field in the llq -l output) will be counted as the amount of Blue Gene resource used. The used shares of the Blue Gene resources are independent of the used shares of the CPU resources and are made available through the LoadLeveler variables UserUsedBgShares and GroupUsedBgShares. LoadLeveler variable JobIsBlueGene will indicate whether a job step is a Blue Gene job step or not. LoadLeveler administrators have flexibility Chapter 6. Performing additional administrator tasks 159
  • 180. in specifying the behavior of fair share scheduling by using these variables in the SYSPRIO expression. The llfs command and the related APIs can also handle requests related to the Blue Gene resources. For more information about fair share scheduling, see “Using fair share scheduling.” Blue Gene heterogeneous memory support The LoadLeveler job command file has a bg_requirements keyword that can be used to specify the requirements that a Blue Gene base partition must meet to execute the job step. The Blue Gene compute nodes (C-nodes) in the same base partition have the same amount of physical memory. The C-nodes in different base partitions might have different amounts of physical memory. The bg_requirements job command file keyword allows users to specify the memory requirement on the Blue Gene C-nodes. The bg_requirements keyword works like the requirements keyword, but it can only support memory requirements and applies only to Blue Gene base partitions. For a Blue Gene job step, the requirements keyword value applies to the front end node needed by the job step and the bg_requirements keyword value applies to the Blue Gene base partitions needed by the job step. Blue Gene preemption support Preemption support for Blue Gene jobs has been enabled. Blue Gene jobs have the same preemption support as non-Blue Gene jobs. In a typical Blue Gene system, many Blue Gene jobs share the same front end node while dedicated Blue Gene resources are used for each job. To avoid preempting Blue Gene jobs that use different Blue Gene resources as requested by a preempting job, ENOUGH instead of ALL must be used in the PREEMPT_CLASS rules for Blue Gene job preemption. For more information about preemption, see “Preempting and resuming jobs” on page 126 Blue Gene/L HTC partition support The allocation of High Throughput Computing (HTC) partitions on Blue Gene/L is supported when the LoadLeveler BG_CACHE_PARTITIONS configuration keyword is set to false. See the following IBM System Blue Gene Solution Redbooks (dated April 27, 2007) for more information about Blue Gene/L HTC support: v IBM Blue Gene/L: System Administration, SG24-7178 v IBM Blue Gene/L: Application Development, SG24-7179 Using fair share scheduling Fair share scheduling in LoadLeveler provides a way to divide resources in a LoadLeveler cluster among users or groups of users. To fairly share cluster resources, LoadLeveler can be configured to allocate a proportion of the resources to each user or group and to let job priorities be 160 TWS LoadLeveler: Using and Administering
  • 181. adjusted based on how much of the resources have been used and when they were used. Generally speaking, LoadLeveler should be configured so that job priorities decrease for a user or group that has recently used more resources than the allocated proportion and job priorities should increase for a user or group that has not run any jobs recently. Administrators can configure the behavior of fair share scheduling through a set of configuration keywords. They can also query fair share information, save a snapshot of historic data, reset and restore fair share scheduling, and perform other functions by using the LoadLeveler llfs command, the GUI, and the corresponding APIs. Fair share scheduling also includes Blue Gene resources (see “Blue Gene fair share scheduling support” on page 159 for more information). Note: The time of day clocks on all of the nodes in the cluster must be synchronized in order for fair share scheduling to work properly. For more information, see the following: v “llfs - Fair share scheduling queries and operations” on page 450 v Corresponding APIs: – “ll_fair_share subroutine” on page 642 – “Data access API” on page 560 v Keywords: – fair_shares – FAIR_SHARE_INTERVAL – FAIR_SHARE_INTERVAL v SYSPRIO expression Fair share scheduling keywords The FAIR_SHARE_TOTAL_SHARES global configuration file keyword is used to specify the total number of shares that each type of resource is divided into. The fair_shares keyword in a user or group stanza in the administration file specifies how many shares the user or group is allocated. The ratio of the fair_shares keyword value in a user or group stanza over the FAIR_SHARE_TOTAL_SHARES keyword value defines the resource usage proportion for the user or group. For example, if a user is allocated one third of the cluster resources, then the ratio of the user’s fair_share value over the FAIR_SHARE_TOTAL_SHARES keyword value should be one third. The LoadLeveler SYSPRIO expression can be configured to let job priorities change to achieve the specified resource usage proportions. Besides changing job priorities, fair share scheduling does not change in any way how LoadLeveler schedules jobs. If a job can be scheduled to run, it will be run regardless of whether the owner and the LoadLeveler group of the job has any shares allocated or not. No matter how many shares are allocated to a user, if the user does not submit any jobs to run, then the resource usage proportion for that user cannot be achieved and other users might be able to use more than their allocated proportions. Note: The sum of all allocated shares for users or groups does not have to equal the value of the FAIR_SHARE_TOTAL_SHARES keyword. The share Chapter 6. Performing additional administrator tasks 161
  • 182. allocation can be used as a way to prevent a single user from consuming too much of the cluster resources and as a way to share the resources as fairly as possible. When the value of the FAIR_SHARE_TOTAL_SHARES keyword is greater than 0, fair share scheduling is on, which means that resource usage data is collected when every job ends, regardless of the fair_shares values for any user or group. The collected usage data is converted to used shares for each user and group. The llfs command can be used to display the allocated and used shares. Turning fair share scheduling on does not mean that job priorities are affected by fair share scheduling. You have to configure the SYSPRIO expression to let fair share scheduling affect job priorities in a way that suits your needs. By default, the value of the FAIR_SHARE_TOTAL_SHARES keyword is 0 and fair share scheduling is disabled. There is a built-in decay mechanism for the historic resource usage data that is collected when jobs end, that is, the initial resource usage value becomes smaller and smaller as times goes by. This decay mechanism allows the most recent resource usage to have more impact on fair share scheduling. The FAIR_SHARE_INTERVAL global configuration file keyword is used to specify how fast the decay is. The shorter the interval, the faster the historic data decays. A resource usage value decays to 5% of its initial value after an elapsed time period of the same length as the FAIR_SHARE_INTERVAL value. Generally, the interval should be at least several times larger than the typical job running time in the cluster to get stable results. A value should be chosen corresponding to how long the historic resource usage data should have an impact on the current job priorities. The LoadLeveler SYSPRIO expression is used to calculate job priorities. A set of LoadLeveler variables including some related to fair share scheduling can be used in the SYSPRIO expression in the global configuration file. You can define the SYSPRIO expression to let fair share scheduling influence the job priorities in a way that is suitable to your needs. For more information, see the SYSPRIO expression in Chapter 12, “Configuration file reference,” on page 263. When the GroupTotalShares, GroupUsedShares, UserTotalShares, UserUsedShares, UserUsedBgShares, GroupUsedBgShares, and JobIsBlueGene and their corresponding user-defined variables are used, you must use the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL global configuration keyword to specify a time interval at which the job priorities will be recalculated using the most recent share usage information. You can add the following user-defined variables to the LoadL_config global configuration file to make it easier to specify fair share scheduling in the SYSPRIO expressions: v GroupRemainingShares = (GroupTotalShares - GroupUsedShares) v GroupHasShares = ($(GroupRemainingShares) > 0) v GroupSharesExceeded = ($(GroupRemainingShares) <= 0) v UserRemainingShares = (UserTotalShares - UserUsedShares) v UserHasShares = ($(UserRemainingShares) > 0) v UserSharesExceeded = ($(UserRemainingShares) <= 0) v UserRemainingBgShares = ( UserTotalShares - UserUsedBgShares) v UserHasBgShares = ( $(UserRemainingBgShares) > 0) v UserBgSharesExceeded = ( $(UserRemainingBgShares) <= 0) v GroupRemainingBgShares = ( GroupTotalShares - GroupUsedBgShares) v GroupHasBgShares = ( $(GroupRemainingBgShares) > 0) 162 TWS LoadLeveler: Using and Administering
  • 183. v GroupBgSharesExceeded = ( $(GroupRemainingBgShares) <= 0) v JobIsNotBlueGene = ! JobIsBlueGene If fair share scheduling is not turned on, either because the FAIR_SHARE_INTERVAL keyword value is not positive or because the scheduler type is not BACKFILL, then the variables will have the following values: GroupTotalShares: 0 GroupUsedShares: 0 $(GroupRemainingShares): 0 $(GroupHasShares): 0 $(GroupSharesExceeded): 1 UserUsedBgShares: 0 $(UserRemainingBgShares): 0 $(UserHasBgShares): 0 $(UserBgSharesExceeded): 1 If a user has the fair_shares keyword set to 10 in its user stanza and the user has used up 8 CPU shares and 3 Blue Gene shares, then the variables will have the following values: UserTotalShares: 10 UserUsedShares: 8 $(UserRemainingShares): 2 $(UserHasShares): 1 $(UserSharesExceeded): 0 UserUsedBgShares: 3 $(UserRemainingBgShares): 7 $(UserHasBgShares): 1 $(UserBgSharesExceeded): 0 If a group has the fair_shares keyword set to 10 in its group stanza and the group has used up 15 CPU shares and 0 Blue Gene shares, then the variables will have the following values: GroupTotalShares: 10 GroupUsedShares: 15 $(GroupRemainingShares): -5 $(GroupHasShares): 0 $(GroupSharesExceeded): 1 GroupUsedBgShares: 0 $(GroupRemainingBgShares): 10 $(GroupHasBgShares): 1 $(GroupBgSharesExceeded): 0 The values of the following variables for a Blue Gene job step: JobIsBlueGene: 1 $(JobIsNotBlueGene): 0 The values of the following variables for a non-Blue Gene job step: JobIsBlueGene: 0 $(JobIsNotBlueGene): 1 Reconfiguring fair share scheduling keywords LoadLeveler configuration and administration files can be modified to assign new values to various keywords. After files have been modified, issue the llctl -g reconfig command to read in the new keyword values. All new keywords introduced for fair share scheduling become effective right after reconfiguration. Chapter 6. Performing additional administrator tasks 163
  • 184. Reconfiguring when the Schedd daemons are up To avoid any inconsistency, change the value of the FAIR_SHARE_INTERVAL keyword while the central manager and all Schedd daemons are up, then do the reconfiguration. After the reconfiguration, the following will happen: v All historic fair share scheduling data will be decayed to the current time using the old value. v The old value is replaced with the new value v The new value will be used from here on Note: 1. You must have the same value for the FAIR_SHARE_INTERVAL keyword in the central manager and the Schedd daemons because the FAIR_SHARE_INTERVAL keyword determines the rate of decay for the historic fair share data and the same value on the daemons maintains the data consistency. 2. There are some LoadLeveler configuration parameters that require restarting LoadLeveler with llctl recycle for changes to take effect. You can use llctl recycle when changing fair share parameters also. The effect will be the same as using llctl reconfig because when the Schedd machine shuts down normally, the fair share scheduling data will be decayed to the time of the shutdown and it will be saved. Reconfiguring when the Schedd daemons are down The value for the FAIR_SHARE_INTERVAL keyword may need to be changed while a Schedd daemon is down. If the value for the FAIR_SHARE_INTERVAL keyword has to be changed while a Schedd daemon is down, the following will happen when the Schedd daemon is restarted: v All historic fair share scheduling data will be read in from the disk files in the $(SPOOL) directory with no change. v When a new job ends, the historic fair share scheduling data for the owner and the LoadLeveler group of the job will be updated using the new value and then sent to the central manager. The new value is used effectively from the time the data was last updated before the Schedd went down, not from the time of the reconfiguration as it would normally be. Example: three groups share a LoadLeveler cluster This example in which three groups share a LoadLeveler cluster may apply to your situation. For purposes of this example, we will assume the following: v Three groups of users share a LoadLeveler cluster and each group is to have one third of the resources v Historic data will have significant impact for about 10 days v Groups with unused shares will have much higher job priorities than the groups which have used up their shares To setup for fair share scheduling with these assumptions, an administrator could update the LoadL_config global configuration file as follows: 164 TWS LoadLeveler: Using and Administering
  • 185. FAIR_SHARE_TOTAL_SHARES = 99 FAIR_SHARE_INTERVAL = 240 NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 300 GroupRemainingShares = ( GroupTotalShares - GroupUsedShares ) GroupHasShares = ( $(GroupRemainingShares) > 0 ) SYSPRIO : 10000000 * $(GroupHasShares) - QDate In the admin file LoadL_admin, add: chemistry: type = group include_users = harold mark kim enci george charlie fair_shares = 33 physics: type = group include_users = cnyang gchen newton roy fair_shares = 33 math: type = group include_users = rich dave chris popco fair_shares = 33 When user rich in the math group wants to submit a job, the following keyword can be put into the job command file so that the job will have high priority through the math group: #@group=math If user rich has a job that does not need to be run right away or as soon as possible (can be run at any time), then he should run the job in a LoadLeveler group with no shares allocated (for example, the No_Group group). Because the group No_Group has no shares allocated to it in this example, $(GroupHasShares) has a value of 0 and the job priority will be lower than those jobs whose group has unused shares. The job will be run when all higher priority jobs are done or when it is used to backfill a higher priority job (will be run whenever it can be scheduled). Example: two thousand students share a LoadLeveler cluster This example in which two thousand students share a LoadLeveler cluster may apply to your situation. For purposes of this example, we will assume the following: v A university has 2000 students who share a LoadLeveler cluster and every student is to have the same number of shares of the resources. v Historic data will have significant impact for about 7 days (because FAIR_SHARE_INTERVAL is not specified and the default value is 7 days). v A student with unused shares is to have somewhat higher job priorities and let the priorities decrease as the number of used shares increase. The LoadL_config global configuration file should contain the following: Chapter 6. Performing additional administrator tasks 165
  • 186. FAIR_SHARE_TOTAL_SHARES = 10000 NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 600 UserRemainingShares = ( UserTotalShares - UserUsedShares ) SYSPRIO : 100000 * $(UserRemainingShares) - QDate In the LoadL_admin admin file, add default: type = user fair_shares = 5 Note: The value fair_shares = 5 is the result of dividing the total shares into the number of students (10000 ÷ 2000). The number of students can be more or less than 2000, but the same configuration parameters still prevent a single user from using too much cluster resources in a short time period. We can see from the SYSPRIO expression that the larger the number of unused shares for a student and the earlier the job is submitted, the higher the priority is for the student’s job. Querying information about fair share scheduling The llfs command, the GUI, and the data access API can be used to query information about fair share scheduling. The llfs command without any options displays the allocated and used shares for all users and LoadLeveler groups having run one or more jobs in the cluster to completion. The -u and -g options can show the allocated and used shares for any user or LoadLeveler group regardless of whether they have run any jobs in the cluster. In either case, the user or group need not have any fair_shares allocated in the LoadL_admin administration file for the usage to be reported by the llfs command. Resetting fair share scheduling The llfs -r command option (or the GUI option Reset historic data), by default, will start fair share scheduling from the beginning, which means that all the previous historic data will be lost. This command will not be run unless all Schedd daemons are up and running. In case a Schedd daemon is down when this command option is being run, the request will not be processed. To manually reset fair share scheduling, bring down the LoadLeveler cluster, remove all fair share data files (fair_share_queue.dir and fair_share_queue.pag) in the $(SPOOL) directory and then restart the LoadLeveler cluster. Saving historic data The LoadLeveler central manager holds the complete historic fair share data when it is up. Every Schedd holds a portion of the historic fair share data and the data is stored on disk in the $(SPOOL) directory. When the central manager is restarted, it receives the historic fair share data from every Schedd. If a Schedd machine is down temporarily and the central manager remains up, the data in the central manager is not affected. In case a Schedd machine is permanently damaged and 166 TWS LoadLeveler: Using and Administering
  • 187. the central manager restarts, the central manager will not be able to get all of the historic fair share data because the data stored on the damaged Schedd is lost. If the value of FAIR_SHARE_INTERVAL is very large, many days of data on the damaged Schedd could be lost. To reduce the loss of data, the historic fair share data in the central manager can be saved to disk periodically. Recovery can be done using the latest saved data when a Schedd machine is permanently out of service. The llfs -s command, the GUI, or the ll_fair_share API can be used to save a snapshot of the historic data in the central manager to a file. Restoring saved historic data You can use the llfs -r command option, the GUI, or the ll_fair_share API to restore fair share scheduling to a previously saved state. For the file name, specify a file you saved previously using llfs -s. If the central manager goes down and restarts again, the historic data stored in an out of service Schedd machine is not reported to the central manager. If the Schedd machine will not be brought back to service at all, then the administrator can consider restoring fair share scheduling to a state corresponding to the latest saved file. Procedure for recovering a job spool The llmovespool command is intended for recovery purposes only. Jobs being managed by a down Schedd are unable to clean up resources or move to completion. These jobs need their job records transferred to another Schedd. The llmovespool command moves the job records from the spool of one managing Schedd to another managing Schedd in the local cluster. All moved jobs retain their original job identifiers. It is very important that the Schedd that created the job records to be moved is not running during the move operation. Jobs within the job queue database will be unrecoverable if the job queue is updated during the move by any process other than the llmovespool command. The llmovespool command operates on a set of job records, these records are updated as the command executes. When a job is successfully moved, the records for that job are deleted. Job records that are not moved because of a recoverable failure, like the original Schedd not being fenced, may have the llmovespool command executed against them again. It is very important that a Schedd never reads the job records from the spool being moved. Jobs will be unrecoverable if more than one Schedd is considered to be the managing Schedd. The procedure for recovering a job spool is: 1. Move the files located in the spool directory to be transferred to another directory before entering the llmovespool command in order to guarantee that no other Schedd process is updating the job records. 2. Add the statement schedd_fenced=true to the machine stanza of the original Schedd node in order to guarantee that the central manager ignores connections from the original managing Schedd, and to prevent conflicts from arising if the original Schedd is restarted after the llmovespool command has been run. See the schedd_fenced=true keyword in Chapter 13, “Administration file reference,” on page 321 for more information. Chapter 6. Performing additional administrator tasks 167
  • 188. 3. Reconfigure the central manager node so that it recognizes that the original Schedd is ″fenced″. 4. Issue the llmovespool command providing the spool directory where the job records are stored. The command displays a message that the transfer has started and reports status for each job as it is processed. For more information about the llmovespool command, see “llmovespool - Move job records” on page 472. For more information about the ll_move_spool API, see “ll_move_spool subroutine” on page 683. 168 TWS LoadLeveler: Using and Administering
  • 189. Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks | Note: This is the last release that will provide the Motif-based graphical user | interface xloadl. The function available in xloadl has been frozen since TWS | LoadLeveler 3.3.2. The end user can perform many tasks more efficiently and faster using the graphical user interface (GUI), but there are certain tasks that end users cannot perform unless they have the proper authority. If you are defined as a LoadLeveler administrator in the LoadLeveler configuration file then you are immediately granted administrative authority and can perform the administrative tasks discussed in this topic. To find out how to grant someone administrative authority, see “Defining LoadLeveler administrators” on page 43. You can access LoadLeveler administrative commands using the Admin pull-down menu on both the Jobs window and the Machines window of the GUI. The Admin pull-down menu on the Jobs window corresponds to the command options available in the llhold, llfavoruser, and llfavorjob commands. The Admin pull-down menu on the Machines window corresponds to the command options available in the llctl command. The main window of the GUI has three sub-windows: one for job status with pull-down menus for job-related commands, one for machine status with pull-down menus for machine-related commands, and one for messages and logs (see “The LoadLeveler main window” on page 404 in the Chapter 15, “Graphical user interface (GUI) reference,” on page 403). There are a variety of facilities available that allow you to sort and select the items displayed. Job-related administrative actions You access the administrative commands that act on jobs through the Admin pull-down menu in the Jobs window of the GUI. You can perform the following tasks with this menu: Favor Users Allows you to favor users. This means that you can select one or more users whose jobs you want to move up in the job queue. This corresponds to the llfavoruser command. Select Admin from the Jobs window Select Favor User The Order by User window appears. Type in The name of the user whose jobs you want to favor. Press OK 169
  • 190. Unfavor Users Allows you to unfavor users. This means that you want to unfavor the user’s jobs which you previously favored. This corresponds to the llfavoruser command. Select Admin from the Jobs window Select Unfavor User The Order by User window appears. Type in The name of the user for whom you want to unfavor their jobs. Press OK Favor Jobs Allows you to select a job that you want to favor. This corresponds to the llfavorjob command. Select One or more jobs from the Jobs window Select Admin from the Jobs window Select Favor Job The selected jobs are favored. Press OK Unfavor Jobs Allows you select a job that you want to unfavor. This corresponds to the llfavorjob command. Select One or more jobs from the Jobs window Select Admin from the Jobs window Select Unfavor Job Unfavors the jobs that you previously selected. Syshold Allows you to place a system hold on a job. This corresponds to the llhold command. Select A job from the Jobs window Select Admin pull-down menu from the Jobs window Select Syshold to place a system hold on the job. Release From Hold Allows you to release the system hold on a job. This corresponds to the llhold command. Select A job from the Jobs window Select Admin pull-down menu from the Jobs window Select Release From Hold to release the system hold on the job. Preempt Available when using the BACKFILL or external schedulers. Preempt allows you to place the selected jobs in preempted state. This action corresponds to the llpreempt command. Select One or more jobs from the Jobs window 170 TWS LoadLeveler: Using and Administering
  • 191. Select Admin pull-down menu from the Jobs window Select Preempt Resume Preempted Job Available only when using the BACKFILL or external schedulers. Resume Preempted Job allows you to remove user-initiated preemption (initiated using the Preempt menu option or the llpreempt command) from the selected jobs. This action corresponds to the llpreempt -r command. Select One or more jobs from the Jobs window Select Admin pull-down menu from the Jobs window Select Resume Preempted Job Prevent Preempt Available only when using the BACKFILL or API scheduler. Prevent Preempt allows you to place the selected running job into a non-preemptable state. When the BACKFILL or API scheduler is in use, this is equivalent to the llmodify -p nopreempt command. Select One job from the Jobs window Select Admin pull-down menu from the Jobs window Select Prevent Preempt Allow Preempt Available only when using the BACKFILL or API scheduler, Allow Preempt makes the unpreemptable job preemptable again. When the BACKFILL or API scheduler is in use, this is equivalent to the llmodify -p preempt command. Select One or more jobs from the Jobs window Select Admin pull-down menu from the Jobs window Select Allow Preempt Extend Wallclock Limits Allows you to extend the wallclock limits by the number of minutes specified. This corresponds to the llmodify -W command. Select Admin pull-down window from the Jobs window Select Extend Wallclock Limit The Extend Wallclock Limits window appears. Type in The number of minutes to extend the wallclock limit. Press OK Modify Job Priority Allows you to modify the system priority of a job step. This corresponds to the llmodify -s command. Select Admin pull-down window from the Jobs window Select Modify Job Priority The Modify Job Priority window appears. Type in An integer value for system priority. Press OK Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks 171
  • 192. Move to another cluster Allows you to move an idle job from the local cluster to another. This menu items appears only when a mulitcluster environment is configured. It corresponds to the llmovejob command. Select Admin pull-down window from the Jobs window Select Modify Job Priority The Move Job to Another Cluster window appears. Select The name of the target cluster. Press OK Machine-related administrative actions You access the administrative commands that act on machines using the Admin pull-down menu in the Machines window of the GUI. Using the GUI pull-down menu, you can perform the tasks described in this topic. Start All Starts LoadLeveler on all machines listed in machine stanzas beginning with the central manager. Submit-only machines are skipped. Use this option when specifying alternate central managers in order to ensure the primary central manager starts before any alternate central manager attempts to serve as central manager. Select Admin from the Machines window. Select Start All Start LoadLeveler Allows you to start LoadLeveler on selected machines. Select One or more machines on which you want to start LoadLeveler. Select Admin from the Machines window. Select Start LoadLeveler Start Drained Allows you to start LoadLeveler with startd drained on selected machines. Select One or more machines on which you want startd drained. Select Admin from the Machines window. Select Start Drained Stop LoadLeveler Allows you to stop LoadLeveler on selected machines. Select One or more machines on which you want to stop LoadLeveler. Select Admin from the Machines window. Select Stop LoadLeveler. Stop All Stops LoadLeveler on all machines listed in machine stanzas. Submit-only machines are skipped. Select Admin from the Machines window. Select Stop All 172 TWS LoadLeveler: Using and Administering
  • 193. Reconfig Forces all daemons to reread the configuration files Select The machine on which you want to operate. To reconfigure this xloadl session, choose reconfig but do not select a machine. Select Admin from the Machines window. Select reconfig Recycle Stops all LoadLeveler daemons and restarts them. Select The machine on which you want to operate. Select Admin from the Machines window. Select recycle Configuration Tasks Starts Configuration Tasks wizard Select Admin from the Machines window. Select Config Tasks Note: Use the invoking script lltg to start the wizard outside of xloadl. This option will appear on the pull-down only if the LoadL.tguides fileset is installed. Drain Allows no more LoadLeveler jobs to begin running on this machine but it does allow running jobs to complete. Select The machine on which you want to operate. Select Admin from the Machines window. Select drain. A cascading menu allows you to select either daemons, Schedd, startd, or startd by class. If you select daemons, both the startd and the Schedd on the selected machine will be drained. If you select Schedd, only the Schedd on the selected machine will be drained. If you select startd, only the startd on the selected machine will be drained. If you select startd by class, a window appears which allows you to select classes to be drained. Flush Terminates running jobs on this host and sends them back to the system queue to await redispatch. No new jobs are redispatched to this machine until resume is issued. Forces a checkpoint if jobs are enabled for checkpointing. Select The machine on which you want to operate. Select Admin from the Machines window. Select flush Suspend Suspends all jobs on this host. Select The machine on which you want to operate. Select Admin from the Machines window. Select suspend Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks 173
  • 194. Resume Resumes all jobs on this machine. Select The machine on which you want to operate. Select Admin from the Machines window Select resume A cascading menu allows you to select either daemons, Schedd, startd, or startd by class. If you select daemons, both machines will be resumed. If you select Schedd, only the Schedd on the selected machine will be resumed. If you select startd, only the startd on the selected machine will be resumed. If you select startd by class, a window appears which allows you to select classes to be resumed. Capture Data Collects information on the machines selected. Select The machine on which you want to operate. Select Admin from the Machines window. Select Capture Data. Collect Account Data Collects accounting data on the machines selected. Select The machine on which you want to operate. Select Admin from the Machines window. Select Collect Account Data. A window appears prompting you to enter the name of the directory in which you want the collected data stored. Collect Reservation Data Collects reservation data on the machines selected. Select The machine on which you want to operate. Select Admin from the Machines window. Select Collect Reservation Data. A window appears prompting you to enter the name of the directory in which you want the collected data stored. Create Account Report Creates an accounting report for you. Select Admin → Create Account Report... Note: If you want to receive an extended accounting report, select the extended cascading button. A window appears prompting you to enter the following information: v A short, long, or extended version of the output. The short version is the default. v The user ID v The class name v The LoadL (LoadLeveler) group name v The UNIX group name v The Allocated host v The job ID v The report Type 174 TWS LoadLeveler: Using and Administering
  • 195. v The section v A start and end date for the report. If no date is specified, the default is to report all of the data in the report. v The name of the input data file. v The name of the output data file. This is the same as stdout. Press OK The window closes and you return to the main window. The report appears in the Messages window if no output data file was specified. Move Spool Moves the job records from the spool of one managing Schedd to another managing Schedd in the local cluster. This is intended for recovery purposes only. Select One Schedd machine from the Machines window. Select Admin from the Machines window. Select Move Spool A window is displayed prompting you to enter the directory containing the job records to be moved. Press OK Version Displays version and release data for LoadLeveler on the machines selected in an information window. Select The machine on which you want to operate. Select Admin from the Machines window. Select version Fair Share Scheduling Provides fair share scheduling functions (see “llfs - Fair share scheduling queries and operations” on page 450). Select Admin from the Machines window. Select Fair Share Scheduling A cascading menu allows you to select one of the following: v Show Displays fair share scheduling information for all users or for specified users and groups. v Save historic data Saves fair share scheduling information into the directory specified. v Restore historic data Restores fair share scheduling data to a state corresponding to a file previously saved by Save historic data or the llfs -s command. v Reset historic data Erases all historic CPU data to reset fair share scheduling. Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks 175
  • 196. 176 TWS LoadLeveler: Using and Administering
  • 197. Part 3. Submitting and managing TWS LoadLeveler jobs After an administrator installs IBM Tivoli Workload Scheduler (TWS) LoadLeveler and customizes the environment, general users can build and submit jobs to exploit the many features of the TWS LoadLeveler runtime environment. 177
  • 198. 178 TWS LoadLeveler: Using and Administering
  • 199. Chapter 8. Building and submitting jobs Learn more about building and submitting jobs. The topics listed Table 40 will help you learn about building and submitting jobs: Table 40. Learning about building and submitting jobs To learn about: Read the following: Creating and submitting serial and Chapter 8, “Building and submitting jobs” parallel jobs Controlling and monitoring TWS Chapter 9, “Managing submitted jobs,” on page LoadLeveler jobs 229 Ways to control or monitor TWS v Chapter 16, “Commands,” on page 411 LoadLeveler operations by using the v Chapter 10, “Example: Using commands to TWS LoadLeveler commands, GUI, and build, submit, and manage jobs,” on page 235 APIs v Chapter 11, “Using LoadLeveler’s GUI to build, submit, and manage jobs,” on page 237 v Chapter 17, “Application programming interfaces (APIs),” on page 541 Table 41 lists the tasks that general users perform to run LoadLeveler jobs. Table 41. Roadmap of user tasks for building and submitting jobs To learn about: Read the following: Building jobs v “Building a job command file” v “Editing job command files” on page 185 v “Defining resources for a job step” on page 185 v “Working with coscheduled job steps” on page 187 v “Using bulk data transfer” on page 188 v “Preparing a job for checkpoint/restart” on page 190 v “Preparing a job for preemption” on page 193 Submitting jobs v “Submitting a job command file” on page 193 v “llsubmit - Submit a job” on page 531 Working with parallel jobs “Working with parallel jobs” on page 194 Working with reserved node “Working with reservations” on page 213 resources and the jobs that use them Correctly specifying job Chapter 14, “Job command file reference,” on page 357 command file keywords Building a job command file Before you can submit a job or perform any other job related tasks, you need to build a job command file. A job command file describes the job you want to submit, and can include LoadLeveler keyword statements. For example, to specify a binary to be executed, 179
  • 200. you can use the executable keyword, which is described later in this topic. To specify a shell script to be executed, the executable keyword can be used; if it is not used, LoadLeveler assumes that the job command file itself is the executable. The job command file can include the following: v LoadLeveler keyword statements: A keyword is a word that can appear in job command files. A keyword statement is a statement that begins with a LoadLeveler keyword. These keywords are described in “Job command file keyword descriptions” on page 359. v Comment statements: You can use comments to document your job command files. You can add comment lines to the file as you would in a shell script. v Shell command statements: If you use a shell script as the executable, the job command file can include shell commands. v LoadLeveler variables: See “Job command file variables” on page 399 for more information. You can build a job command file either by using the Build a Job window on the GUI or by using a text editor. Using multiple steps in a job command file To specify a stream of job steps, you need to list each job step in the job command file. You must specify one queue statement for each job step. Also, the executables for all job steps in the job command file must exist when you submit the job. For most keywords, if you specify the keyword in a job step of a multi-step job, its value is inherited by all proceeding job steps. Exceptions to this are noted in the keyword description. LoadLeveler treats all job steps as independent job steps unless you use the dependency keyword. If you use the dependency keyword, LoadLeveler determines whether a job step should run based upon the exit status of the previously run job step. For example, Figure 19 on page 181 contains two separate job steps. Notice that step1 is the first job step to run and that step2 is a job step that runs only if step1 exits with the correct exit status. 180 TWS LoadLeveler: Using and Administering
  • 201. # This job command file lists two job steps called "step1" # and "step2". "step2" only runs if "step1" completes # with exit status = 0. Each job step requires a new # queue statement. # # @ step_name = step1 # @ executable = executable1 # @ input = step1.in1 # @ output = step1.out1 # @ error = step2.err1 # @ queue # @ dependency = (step1 == 0) # @ step_name = step2 # @ executable = executable2 # @ input = step2.in1 # @ output = step2.out1 # @ error = step2.err1 # @ queue Figure 19. Job command file with multiple steps In Figure 19, step1 is called the sustaining job step. step2 is called the dependent job step because whether or not it begins to run is dependent upon the exit status of step1. A single sustaining job step can have more than one dependent job steps and a dependent job step can also have job steps dependent upon it. In Figure 19, each job step has its own executable, input, output, and error statements. Your job steps can have their own separate statements, or they can use those statements defined in a previous job step. For example, in Figure 20, step2 uses the executable statement defined in step1: # This job command file uses only one executable for # both job steps. # # @ step_name = step1 # @ executable = executable1 # @ input = step1.in1 # @ output = step1.out1 # @ error = step1.err1 # @ queue # @ dependency = (step1 == 0) # @ step_name = step2 # @ input = step2.in1 # @ output = step2.out1 # @ error = step2.err1 # @ queue Figure 20. Job command file with multiple steps and one executable Examples: Job command files These examples of job command files may apply to your situation. v Example 1: Generating multiple jobs with varying outputs To run a program several times, varying the initial conditions each time, you could can multiple LoadLeveler scripts, each specifying a different input and output file as described in Figure 22 on page 183. It would probably be more convenient to prepare different input files and submit the job only once, letting LoadLeveler generate the output files and do the multiple submissions for you. Figure 21 on page 182 illustrates the following: – You can refer to the LoadLeveler name of your job symbolically, using $(jobid) and $(stepid) in the LoadLeveler script file. – $(jobid) refers to the job identifier. Chapter 8. Building and submitting jobs 181
  • 202. – $(stepid) refers to the job step identifier and increases after each queue command. Therefore, you only need to specify input, output, and error statements once to have LoadLeveler name these files correctly. Assume that you created five input files and each input file has different initial conditions for the program. The names of the input files are in the form longjob.in.x, where x is 0–4. Submitting the LoadLeveler script shown in Figure 21 results in your program running five times, each time with a different input file. LoadLeveler generates the output file from the LoadLeveler job step IDs. This ensures that the results from the different submissions are not merged. # @ executable = longjob # @ input = longjob.in.$(stepid) # @ output = longjob.out.$(jobid).$(stepid) # @ error = longjob.err.$(jobid).$(stepid) # @ queue # @ queue # @ queue # @ queue # @ queue Figure 21. Job command file with varying input statements To submit the job, type the command: llsubmit longjob.cmd LoadLeveler responds by issuing the following: submit: The job "ll6.23" with 5 job steps has been submitted. Table 42 lists the standard input files, standard output files, and standard error files for the five job steps: Table 42. Standard files for the five job steps Job Step Standard Input Standard Output Standard Error ll6.23.0 longjob.in.0 longjob.out.23.0 longjob.err.23.0 ll6.23.1 longjob.in.1 longjob.out.23.1 longjob.err.23.1 ll6.23.2 longjob.in.2 longjob.out.23.2 longjob.err.23.2 ll6.23.3 longjob.in.3 longjob.out.23.3 longjob.err.23.3 ll6.23.4 longjob.in.4 longjob.out.23.4 longjob.err.23.4 v Example 2: Using LoadLeveler variables in a job command file Figure 22 on page 183 shows how you can use LoadLeveler variables in a job command file to assign different names to input and output files. This example assumes the following: – The name of the machine from which the job is submitted is lltest1 – The user’s home directory is /u/rhclark and the current working directory is /u/rhclark/OSL – LoadLeveler assigns a value of 122 to $(jobid). In Job Step 0: – LoadLeveler creates the subdirectories oslsslv_out and oslsslv_err if they do not exist at the time the job step is started. In Job Step 1: 182 TWS LoadLeveler: Using and Administering
  • 203. – The character string rhclark denotes the home directory of user rhclark in input, output, error, and executable statements. – The $(base_executable) variable is set to be the “base” portion of the executable, which is oslsslv. – The $(host) variable is equivalent to $(hostname). Similarly, $(jobid) and $(stepid) are equivalent to $(cluster) and $(process), respectively. In Job Step 2: – This job step is executed only if the return codes from Step 0 and Step 1 are both equal to zero. – The initial working directory for Step 2 is explicitly specified. # Job step 0 ============================================================ # The names of the output and error files created by this job step are: # # output: /u/rhclark/OSL/oslsslv_out/lltest1.122.0.out # error : /u/rhclark/OSL/oslsslv_err/lltest1_122_0_err # # @ job_name = OSL # @ step_name = step_0 # @ executable = oslsslv # @ arguments = -maxmin=min -scale=yes -alg=dual # @ environment = OSL_ENV1=20000; OSL_ENV2=500000 # @ requirements = (Arch == "R6000") && (OpSys == "AIX53") # @ input = test01.mps.$(stepid) # @ output = $(executable)_out/$(host).$(jobid).$(stepid).out # @ error = $(executable)_err/$(host)_$(jobid)_$(stepid)_err # @ queue # # Job step 1 ============================================================ # The names of the output and error files created by this job step are: # # output: /u/rhclark/OSL/oslsslv_out/lltest1.122.1.out # error : /u/rhclark/OSL/oslsslv_err/lltest1_122_1_err # # @ step_name = step_1 # @ executable = rhclark/$(job_name)/oslsslv # @ arguments = -maxmin=max -scale=no -alg=primal # @ environment = OSL_ENV1=60000; OSL_ENV2=500000; OSL_ENV3=70000; OSL_ENV4=800000; # @ input = rhclark/$(job_name)/test01.mps.$(stepid) # @ output = rhclark/$(job_name)/$(base_executable)_out/$(hostname).$(cluster).$(process).out # @ error = rhclark/$(job_name)/$(base_executable)_err/$(hostname)_$(cluster)_$(process)_err # @ queue # # Job step 2 ============================================================ # The names of the output and error files created by this job step are: # # output: /u/rhclark/OSL/oslsslv_out/lltest1.122.2.out # error : /u/rhclark/OSL/oslsslv_err/lltest1_122_2_err # # @ step_name = OSL # @ dependency = (step_0 == 0) && (step_1 == 0) # @ comment = oslsslv # @ initialdir = /u/rhclark/$(step_name) # @ arguments = -maxmin=min -scale=yes -alg=dual # @ environment = OSL_ENV1=300000; OSL_ENV2=500000 # @ input = test01.mps.$(stepid) # @ output = $(comment)_out/$(host).$(jobid).$(stepid).out # @ error = $(comment)_err/$(host)_$(jobid)_$(stepid)_err # @ queue Figure 22. Using LoadLeveler variables in a job command file v Example 3: Using the job command file as the executable The name of the sample script shown in Figure 23 on page 185 is run_spice_job. This script illustrates the following: – The script does not contain the executable keyword. When you do not use this keyword, LoadLeveler assumes that the script is the executable. (Since the Chapter 8. Building and submitting jobs 183
  • 204. name of the script is run_spice_job, you can add the executable = run_spice_job statement to the script, but it is not necessary.) – The job consists of four job steps (there are 4 queue statements). The spice3f5 and spice2g6 programs are invoked at each job step using different input data files: - spice3f5: Input for this program is from the file spice3f5_input_x where x has a value of 0, 1, and 2 for job steps 0, 1, and 2, respectively. The name of this file is passed as the first argument to the script. Standard output and standard error data generated by spice3f5 are directed to the file spice3f5_output_x. The name of this file is passed as second argument to the script. In job step 3, the names of the input and output files are spice3f5_input_benchmark1 and spice3f5_output_benchmark1, respectively. - spice2g6: Input for this program is from the file spice2g6_input_x. Standard output and standard error data generated by spice2g6 together with all other standard output and standard error data generated by this script are directed to the files spice_test_output_x and spice_test_error_x, respectively. In job step 3, the name of the input file is spice2g6_input_benchmark1. The standard output and standard error files are spice_test_output_benchmark1 and spice_test_error_benchmark1. All file names that are not fully qualified are relative to the initial working directory /home/loadl/spice. LoadLeveler will send the job steps 0 and 1 of this job to a machine for that has a real memory of 64 MB or more for execution. Job step 2 most likely will be sent to a machine that has more that 128 MB of real memory and has the ESSL library installed since these preferences have been stated using the LoadLeveler preferences keyword. LoadLeveler will send job step 3 to the machine ll5.pok.ibm.com for execution because of the explicit requirement for this machine in the requirements statement. 184 TWS LoadLeveler: Using and Administering
  • 205. #!/bin/ksh # @ job_name = spice_test # @ account_no = 99999 # @ class = small # @ arguments = spice3f5_input_$(stepid) spice3f5_output_$(stepid) # @ input = spice2g6_input_$(stepid) # @ output = $(job_name)_output_$(stepid) # @ error = $(job_name)_error_$(stepid) # @ initialdir = /home/loadl/spice # @ requirements = ((Arch == "R6000") && # (OpSys == "AIX53") && (Memory > 64)) # @ queue # @ queue # @ preferences = ((Memory > 128) && (Feature == "ESSL")) # @ queue # @ class = large # @ arguments = spice3f5_input_benchmark1 spice3f5_output_benchmark1 # @ requirements = (Machine == "ll5.pok.ibm.com") # @ input = spice2g6_input_benchmark1 # @ output = $(job_name)_output_benchmark1 # @ error = $(job_name)_error_benchmark1 # @ queue OS_NAME=`unamè case $OS_NAME in AIX) echo "Running $OS_NAME version of spice3f5" > $2 AIX_bin/spice3f5 < $1 >> $2 2>&1 echo "Running $OS_NAME version of spice2g6" AIX_bin/spice2g6 ;; *) echo "spice3f5 for $OS_NAME is not available" > $2 echo "spice2g6 for $OS_NAME is not available" ;; esac Figure 23. Job command file used as the executable Editing job command files After you build a job command file, you can edit it using the editor of your choice. You may want to change the name of the executable or add or delete some statements. When you create a job command file, it is considered the job executable unless you specify otherwise by using the executable keyword in the job command file. LoadLeveler copies the executable to the spool directory unless the checkpoint keyword was set to yes or interval. Jobs that are to be checkpointed cannot be moved to the spool directory. Do not make any changes to the executable while the job is still in the queue–it could affect the way that job runs. Defining resources for a job step The LoadLeveler user may use the resources keyword in the job command file to specify the resources to be consumed by each task of a job step. If the resources keyword is specified in the job command file, it overrides any default_resources specified by the administrator for the job step’s class. Chapter 8. Building and submitting jobs 185
  • 206. For example, the following job requests one CPU and one FRM license for each of its tasks: resources = ConsumableCpus(1) FRMlicense(1) If this were specified in a serial job step, one CPU and one FRM license would be consumed while the job step runs. If this were a parallel job step, then the number of CPUs and FRM licenses consumed while the job step runs would depend upon how many tasks were running on each machine. For more information on assigning tasks to nodes, see “Task-assignment considerations” on page 196. Alternatively, you can use the node_resources keyword in the job command file to specify the resources to be consumed by the job step on each machine it runs on, regardless of the number of tasks assigned to each machine. If the node_resources keyword is specified in the job command file, it overrides the default_node_resources specified by the administrator for the job step’s class. For example, the following job requests 240 MB of ConsumableMemory on each machine: node_resources = ConsumableMemory(240 mb) Even if one machine only runs one task of the job step, while other machines run multiple tasks, 240 MB will be consumed on every machine. | Submitting jobs requesting data staging | The dstg_in_script keyword causes LoadLeveler to generate an inbound data | staging step, without requiring the #@queue specification. The value assigned to | this keyword is the executable that will be started for data staging and any | arguments needed by this script or executable as well. | The dstg_in_wall_clock_limit keyword specifies a wall clock time for the inbound | data staging step. Specifying the estimated wall clock limit is mandatory when a | data staging script is specified. Similarly, dstg_out_script and | dstg_out_wall_clock_limit will be used for generation and execution of the | outbound data staging step for the job. All data staging job steps are assigned to | the predefined class called data_stage. | Resources required for data staging can be specified using the dstg_resources | keyword. | The dstg_node keyword allows you to specify how data replicas must be created: | v If the value specified is any, one data staging task is executed on any available | node in the cluster with data staging resources. This value can be used with | either the at_submit or the just_in_time configuration options. | v If the value specified is master, one data staging task is executed on the master | node. The master node is the machine that will be used to run the inbound and | outbound data staging steps as well as the first application step of the job. | v If the value is all, a data staging task is executed on each of the nodes that will | be or were used by the first application step. | Any environment variables needed by the data staging scripts can be specified | using the dstg_environment keyword. The copy_all value can be assigned to this | keyword to get all of the user’s environment variables. 186 TWS LoadLeveler: Using and Administering
  • 207. | For detailed information about the data staging job command file keywords, see | “Job command file keyword descriptions” on page 359. Working with coscheduled job steps LoadLeveler allows you to specify that a group of two or more steps within a job are to be coscheduled. Coscheduled steps are dispatched at the same time. Submitting coscheduled job steps | The coschedule = yes keyword in the job command file is used to specify which | steps within a job are to be coscheduled. | All steps within a job with the coschedule keyword set to yes will be coscheduled. | The coscheduled steps will continue to be stored as individual steps in both | memory and in the job queue, but when performing certain operations, such as | scheduling, the steps will be managed as a single entity. An operation initiated on | one of the coscheduled steps will cause the operation to be performed on all other | steps (unless the coscheduling dependency between steps is broken). Determining priority for coscheduled job steps Coscheduled steps are supported only with the BACKFILL scheduler. The LoadLeveler BACKFILL scheduler will only dispatch the set of coscheduled steps when enough resource is available for all steps in the set to start. If the set of coscheduled steps cannot be started immediately, but enough resource will be available in the future, then the resource for all the steps will be reserved. In this case, only one of the coscheduled steps will be designated as a top dog, but enough resources will be reserved for all coscheduled steps and all the steps will be dispatched when the top dog step is started. The coscheduled step with the highest priority in the current job queue will be designated as the primary coscheduled step and all other steps will be secondary coscheduled steps. The primary coscheduled step will determine when the set of coscheduled steps will be scheduled. The priority for all other coscheduled steps is ignored. Supporting preemption of coscheduled job steps Preemption of coscheduled steps is supported. Preemption of coscheduled steps is supported with the following restrictions: v In order for a step S to be preemptable by a coscheduled step, all steps in the set of coscheduled steps must be able to preempt step S. v In order for a step S to preempt a coscheduled step, all steps in the set of coscheduled steps must be preemptable by step S. v The set of job steps available for preemption will be the same for all coscheduled steps. Any resource made available by preemption for one coscheduled step will be available to all other coscheduled steps. To determine the preempt type and preempt method to use when a coscheduled step preempts another step, an order of precedence for preempt types and preempt methods has been defined. All steps in the preempting coscheduled step are examined and the preempt type and preempt method having the highest precedence are used. The order of precedence for preempt type will be ALL and ENOUGH. The precedence order for preempt method is: v Remove Chapter 8. Building and submitting jobs 187
  • 208. v Vacate v System Hold v User hold v Suspend For more information about preempt types and methods, see “Planning to preempt jobs” on page 128. When coscheduled steps are running, if one step is preempted as a result of a system-initiated preemption, then all coscheduled steps are preempted. When determining an optimal preempt set, the BACKFILL scheduler does not consider coscheduled steps as a single entity. All coscheduled steps are in the initial preempt set, but the final preempt set might not include all coscheduled steps, if the scheduler determines the resources of some coscheduled steps are not necessary to start the preempting job step. This implies that more resource than necessary might be preempted when a coscheduled step is in the set of steps to be preempted because regardless of whether or not all coscheduled steps are in the preempt set, if one coscheduled step is preempted, then all coscheduled steps will be preempted. Coscheduled job steps and commands and APIs Commands and APIs that operate on job steps are impacted by coscheduled steps. For the llbind, llcancel, llhold, and llpreempt commands, even if all coscheduled steps are not in the list of targeted steps, the requested operation is performed on all coscheduled steps. For the llmkres and llchres commands, a coscheduled job step cannot be specified when using the -j or -f flags. For the llckpt command, you cannot specify a coscheduled job step using the -u flag. Termination of coscheduled steps If a coscheduled step is dispatched but cannot be started and is rejected by the startd daemon or the starter process, then all coscheduled steps are rejected. If a running step is removed or vacated by LoadLeveler as a result of a system related failure, then all coscheduled steps are removed or vacated. If a running step is vacated as a result of the VACATE expression evaluating to true for the step, then all coscheduled steps are vacated. Using bulk data transfer On systems with device drivers and network adapters that support remote direct-memory access (RDMA), LoadLeveler supports bulk data transfer for jobs that use either the Internet or user space communication protocol mode. For jobs using the Internet protocol (IP jobs), LoadLeveler does not monitor or control the use of bulk transfer. For user space jobs that request bulk transfer, however, LoadLeveler creates a consumable RDMA resource requirement. Machines with Switch Network Interface for HPS network adapters are automatically given an RDMA consumable resource with an available amount of four. Machines with InfiniBand switch adapters are given unlimited RDMA consumable resources. Each step that requests bulk transfer consumes one RDMA resource on each machine on which that step runs. 188 TWS LoadLeveler: Using and Administering
  • 209. The RDMA resource is similar to user-defined consumable resources except in one important way: A user-specified resource requirement is consumed by every task of the job assigned to a machine, whereas the RDMA resource is consumed once on a machine no matter how many tasks of the job are running on the machine. Other than that exception, LoadLeveler handles the RDMA resource as it does all other consumable resources. LoadLeveler displays RDMA resources in the output of the following commands: v llq -l v llsummary -l LoadLeveler also displays RDMA resources in the output of the following commands for machines with Switch Network Interface for HPS network adapters: v llstatus -l v llstatus -R Bulk transfer is supported only on systems where the device driver of the network adapters supports RDMA. To determine which systems will support bulk transfer, use the llstatus command with the -l, -R, or -a flag to display machines with adapters that support RDMA. Machines with Switch Network Interface for HPS network adapters will have an RDMA resource listed in the command output of | llstatus -l and llstatus -R. The llstatus -a command displays the adapters list, which can be used to verify whether InfiniBand adapters are connected to the machines. Under certain conditions, LoadLeveler displays a total count of RDMA resources as less than four for machines with Switch Network Interface for HPS network adapters: v If jobs that LoadLeveler does not manage use RDMA, the amount of available RDMA resource reported to the Negotiator is reduced by the amount consumed by the unmanaged jobs. v In rare situations, LoadLeveler jobs can fail to release their adapter resources before reporting to the Negotiator that they have completed. When this occurs, the amount of available RDMA reported to the Negotiator is reduced by the amount consumed by the unreleased adapter resources. When the adapter resources are eventually released, the RDMA resource they consumed becomes available again. These conditions do not require corrective action. You do not need to perform specific job-definition tasks to enable bulk transfer for LoadLeveler jobs that use the IP network protocol. LoadLeveler cannot affect whether IP communication uses bulk transfer; the implementation of IP where the job runs determines whether bulk transfer is supported. To enable user space jobs to use bulk data transfer, however, all of the following tasks must be completed. If you omit one or more of these steps, the job will run but will not be able to use bulk transfer. v A LoadLeveler administrator must update the LoadLeveler configuration file to include the value RDMA in the SCHEDULE_BY_RESOURCES list for machines with Switch Network Interfaces for HPS network adapters. It is not required to include RDMA in the SCHEDULE_BY_RESOURCES list for machines with InfiniBand network adapters. Example: SCHEDULE_BY_RESOURCES = RDMA others Chapter 8. Building and submitting jobs 189
  • 210. v Users must request bulk transfer for their LoadLeveler jobs, using one of the following methods: – Specifying the bulkxfer keyword in the LoadLeveler job command file. Example: #@ bulkxfer=yes If users specify this keyword for jobs that use the IP communication protocol, LoadLeveler ignores the bulkxfer keyword. – Specifying a POE line command parameter on interactive jobs. Example: poe_job -use_bulk_xfer=yes – Specifying an environment variable on interactive jobs. Example: export MP_USE_BULK_XFER=yes poe_job v Because LoadLeveler honors the bulk transfer request only for LAPI or MPI jobs, users must ensure that the network keyword in the job command file specifies the MPI, LAPI, or MPI_LAPI protocol for user space communication. Examples: network.MPI =sn_single,not_shared,US,HIGH network.MPI_LAPI =sn_single,not_shared,US,HIGH Preparing a job for checkpoint/restart You can checkpoint your entire job step, and allow a job step to restart from the last checkpoint. LoadLeveler has the ability to checkpoint your entire job step, and to allow a job step to restart from the last checkpoint. When a job step is checkpointed, the entire state of each process of that job step is saved by the operating system. On AIX, this checkpoint capability is built in to the base operating system. Use the information in Table 43 on page 191 to correctly configure your job for checkpointing. 190 TWS LoadLeveler: Using and Administering
  • 211. Table 43. Checkpoint configurations To specify that: Do this: Your job is v Add either one of the following two options to your job checkpointable command file: 1. checkpoint = yes This enables your job to checkpoint in any of the following ways: – The application can initiate the checkpoint. This is only available on AIX. – Checkpoint from a program which invokes the ll_ckpt API. – Checkpoint using the llckpt command. – As the result of a flush command. OR 2. checkpoint = interval This enables your job to checkpoint in any of the following ways: – The application can initiate the checkpoint. This is only available on AIX. – Checkpoint from a program which invokes the ll_ckpt API. – Checkpoint using the llckpt command. – Checkpoint automatically taken by LoadLeveler. – As the result of a flush command. v If you would like your job to checkpoint itself, use the API ll_init_ckpt in your serial application, or mpc_init_ckpt for parallel jobs to cause the checkpoint to occur. This is only available on AIX. Your job step’s Add the ckpt_execute_dir keyword to the job command file. executable is to be copied to the execute node Chapter 8. Building and submitting jobs 191
  • 212. Table 43. Checkpoint configurations (continued) To specify that: Do this: LoadLeveler 1. Add the following option to your job command file: automatically checkpoint = interval checkpoints your job at preset intervals This enables your job to checkpoint in any of the following ways: v Checkpoint automatically at preset intervals v Checkpoint initiated from user application. This is only available on AIX. v Checkpoint from a program which invokes the ll_ckpt API v Checkpoint using the llckpt command v As the result of a flush command 2. The system administrators must set the following two keywords in the configuration file to specify how often LoadLeveler should take a checkpoint of the job. These two keywords are: MIN_CKPT_INTERVAL = number Where number specifies the initial period, in seconds, between checkpoints taken for running jobs. MAX_CKPT_INTERVAL = number Where number specifies the maximum period, in seconds, between checkpoints taken for running jobs. The time between checkpoints will be increased after each checkpoint within these limits as follows: v The first checkpoint is taken after a period of time equal to the MIN_CKPT_INTERVAL has passed. v The second checkpoint is taken after LoadLeveler waits twice as long (MIN_CKPT_INTERVAL X 2) v The third checkpoint is taken after LoadLeveler waits twice as long again (MIN_CKPT_INTERVAL X 4) before taking the third checkpoint. LoadLeveler continues to double this period until the value of MAX_CKPT_INTERVAL has been reached, where it stays for the remainder of the job. A minimum value of 900 (15 minutes) and a maximum value of 7200 (2 hours) are the defaults. You can set these keyword values globally in the global configuration file so that all machines in the cluster have the same value, or you can specify a different value for each machine by modifying the local configuration files. Your job will not be Add the following option to your job command file: checkpointed v checkpoint = no This will disable checkpoint. 192 TWS LoadLeveler: Using and Administering
  • 213. Table 43. Checkpoint configurations (continued) To specify that: Do this: Your job has 1. Add the following option to your job command file: successfully v restart_from_ckpt = yes checkpointed and 2. On AIX, specify the name of the checkpoint file by setting the terminated. The job following job command file keywords to specify the directory has left the and file name of the checkpoint file to be used: LoadLeveler job queue v ckpt_dir and you want v ckpt_file LoadLeveler to restart your executable from When the job command file is submitted, a new job will be started an existing checkpoint that uses the specified checkpoint file to restart the previously file. checkpointed job. The job command file which was used to submit the original job should be used to restart from checkpoint. The only modifications to this file should be the addition of restart_from_ckpt = yes and ensuring ckpt_dir and ckpt_file point to the appropriate checkpoint file. Your job has successfully When the job restarts, if a checkpoint file is available, the job will checkpointed. The job be restarted from that file. has been vacated and remains on the If a checkpoint file is not available upon restart, the job will be LoadLeveler job started from the beginning. queue. Preparing a job for preemption Depending on various configuration options, LoadLeveler may preempt your job so that a higher priority job step can run. Administrators may: v Configure LoadLeveler or external schedulers to preempt jobs through various methods. v Specify preemption rules for job classes. v Manually preempt your job using LoadLeveler interfaces. To ensure that your job can be resumed after preemption, set the restart keyword in the job command file to yes. Submitting a job command file After building a job command file, you can submit it for processing either to a machine in the LoadLeveler cluster or one outside of the cluster. See “Querying multiple LoadLeveler clusters” on page 71 for information on submitting a job to a machine outside the cluster. You can submit a job command file either by using the GUI or the llsubmit command. When you submit a job, LoadLeveler assigns a job identifier and one or more step identifiers. The LoadLeveler job identifier consists of the following: Chapter 8. Building and submitting jobs 193
  • 214. machine name The name of the machine which assigned the job identifier. jobid A number given to a group of job steps that were initiated from the same job command file. The LoadLeveler step identifier consists of the following: job identifier The job identifier. stepid A number that is unique for every job step in the job you submit. If a job command file contains multiple job steps, every job step will have the same jobid and a unique stepid. For an example of submitting a job, see Chapter 10, “Example: Using commands to build, submit, and manage jobs,” on page 235. In a multicluster environment, job and step identifiers are assigned by the local cluster and are retained by the job regardless of what cluster the job runs in. Submitting a job using a submit-only machine You can submit jobs from submit-only machines. Submit-only machines allow machines that do not run LoadLeveler daemons to submit jobs to the cluster. You can submit a job using either the submit-only version of the GUI or the llsubmit command. To install submit-only LoadLeveler, follow the procedure in the TWS LoadLeveler: Installation Guide. In addition to allowing you to submit jobs, the submit-only feature allows you to cancel and query jobs from a submit-only machine. Working with parallel jobs LoadLeveler allows you to schedule parallel batch jobs. LoadLeveler allows you to schedule parallel batch jobs that have been written using the following: v On AIX and Linux: – IBM Parallel Environment (PE) – MPICH, which is an open-source, portable implementation of the Message-Passing Interface Standard developed by Argonne National Laboratory – MPICH-GM, which is a port of MPICH on top of Myrinet GM code v On Linux: – MVAPICH, which is a high performance implementation of MPI-1 over InfiniBand based on MPICH support for PE is available in this release of LoadLeveler for Linux 194 TWS LoadLeveler: Using and Administering
  • 215. Step for controlling whether LoadLeveler copies environment variables to all executing nodes You may specify that LoadLeveler is to copy, either to all executing nodes or to only the master executing node, the environment variables that are specified in the environment job command file statement for a parallel job. Before you begin: You need to know: v Whether Parallel Environment (PE) will be used to run the parallel job; if so, then LoadLeveler does not have to copy the application environment to the executing nodes. v How to correctly specify the env_copy keyword. For information about keyword syntax and other details, see the env_copy keyword description. v To specify whether LoadLeveler is to copy environment variables to only the master node, or to all executing nodes, use the #@ env_copy keyword in the job command file. Ensuring that parallel jobs in a cluster run on the correct levels of PE and LoadLeveler software If support for parallel POE jobs is required, users must be aware that when LoadLeveler uses Parallel Environment for parallel job submission, that the PE software requires the same level of PE to be used throughout the parallel job. | Different levels of PE cannot be mixed. For example, PE 5.1 supports only | LoadLeveler 3.5, and PE 4.3 only supports LoadLeveler 3.4.3. Therefore, a POE | parallel job cannot run some of its tasks on LoadLeveler 3.4.3 machines and the | remaining tasks on LoadLeveler 3.5 machines. The requirements keyword of the job command file can be used to ensure that all the tasks of a POE job run on compatible levels of PE and LoadLeveler software in a cluster. Here are three examples showing different ways this can be done: | 1. If the following requirements statement is included in the job command file, | LoadLeveler’s central manager will select only 3.5 or higher machines with the | appropriate OpSys level for this job step. | # @ requirements = (LL_Version >= "3.5") && (OpSys == "AIX53") 2. If a requirements statement such as the following is specified, the tasks of a POE job will see a consistent environment when ″hostname1″ and ″hostname2″ run the same levels of PE and LoadLeveler software. # @ requirements = (Machine == { "hostname1" "hostname2" }) && (OpSys == "AIX53") | 3. If the mixed cluster has been partitioned into 3.4.3 and 3.5 LoadLeveler pools, | then you may use a requirements statement similar to one of the two following | statements to select machines running the same levels of software. | v # @ requirements = (Pool == 35) && (OpSys == "AIX53") | v # @ requirements = (Pool == 343) && (OpSys == "AIX53") | Here, it is assumed that all the 3.4.3 machines in this mixed cluster are assigned | to pool 343 and all 3.5 machines are assigned to pool 35. A LoadLeveler | administrator can use the pool_list keyword of the machine stanza of the | LoadLeveler administration file to assign machines to pools. If a statement such as # @ executable = /bin/poe is specified in a job command file, and if the job is intended to be run on 3.5 machines, then it is important that the job be submitted from a 3.5 machine. When the ″executable″ keyword is used, LoadLeveler will copy the associated binary on the submitting machine and send it Chapter 8. Building and submitting jobs 195
  • 216. to a running machine for execution. In this example, the POE program will fail if the submitting and the running machines are at different software levels. In a mixed cluster, this problem can be circumvented by not using the executable keyword in the job command file. By omitting this keyword, the job command file itself is the shell script that will be executed. If this script invokes a local version of the POE binary then there is no compatibility problem at run time. Task-assignment considerations You can use the keywords to specify how LoadLeveler assigns tasks to nodes. You can use the keywords listed in Table 44 to specify how LoadLeveler assigns tasks to nodes. With the exception of unlimited blocking, each of these methods prioritizes machines in an order based on their MACHPRIO expressions. Various task assignment keywords can be used in combination, and others are mutually exclusive. | Table 44. Valid combinations of task assignment keywords are listed in each column | Keyword Valid Combinations | total_tasks X X | tasks_per_node X X | node = <min, max> X | node = <number> X X | task_geometry X | blocking X | The following examples show how each allocation method works. For each example, consider a 3-node SP with machines named ″N1,″ ″N2,″ and ″N3″. The machines’ order of priority, according to the values of their MACHPRIO expressions, is: N1, N2, N3. N1 has 4 initiators available, N2 has 6, and N3 has 8. node and total_tasks When you specify the node keyword with the total_tasks keyword, the assignment function will allocate all of the tasks in the job step evenly among however many nodes you have specified. If the number of total_tasks is not evenly divisible by the number of nodes, then the assignment function will assign any larger groups to the first nodes on the list that can accept them. In this example, 14 tasks must be allocated among 3 nodes: # @ node=3 # @ total_tasks=14 Table 45 shows the machine, available initiators, and assigned tasks: Table 45. node and total_tasks Machine Available Initiators Assigned Tasks N1 4 4 N2 6 5 N3 8 5 The assignment function divides the 14 tasks into groups of 5, 5, and 4, and begins at the top of the list, to assign the first group of 5. The assignment function starts 196 TWS LoadLeveler: Using and Administering
  • 217. at N1, but because there are only 4 available initiators, cannot assign a block of 5 tasks. Instead, the function moves down the list and assigns the two groups of 5 to N2 and N3, the assignment function then goes back and assigns the group of 4 tasks to N1. node and tasks_per_node When you specify the node keyword with the tasks_per_node keyword, the assignment function will assign tasks in groups of the specified value among the specified number of nodes. # @ node = 3 # @ tasks_per_node = 4 blocking When you specify blocking, tasks are allocated to machines in groups (blocks) of the specified number (blocking factor). The assignment function will assign one block at a time to the machine which is next in the order of priority until all of the tasks have been assigned. If the total number of tasks are not evenly divisible by the blocking factor, the remainder of tasks are allocated to a single node. The blocking keyword must be specified with the total_tasks keyword. For example: # @ blocking = 4 # @ total_tasks = 17 Where blocking specifies that a job’s tasks will be assigned in blocks, and 4 designates the size of the blocks. Table 46 shows how a blocking factor of 4 would work with 17 tasks: Table 46. Blocking Machine Available Initiators Assigned Tasks N1 4 4 N2 6 5 N3 8 8 The assignment function first determines that there will be 4 blocks of 4 tasks, with a remainder of one task. Therefore, the function will allocate the remainder with the first block that it can. N1 gets a block of four tasks, N2 gets a block, plus the remainder, then N3 gets a block. The assignment function begins again at the top of the priority list, and N3 is the only node with enough initiators available, so N3 ends up with the last block. unlimited blocking When you specify unlimited blocking, the assignment function will allocate as many jobs as possible to each node; the function prioritizes nodes primarily by how many initiators each node has available, and secondarily on their MACHPRIO expressions. This method allows you to allocate tasks among as few nodes as possible. To specify unlimited blocking, specify ″unlimited″ as the value for the blocking keyword. The total_tasks keyword must also be specified with unlimited blocking. For example: # @ blocking = unlimited # @ total_tasks = 17 Table 47 on page 198 lists the machine, available initiators, and assigned tasks for unlimited blocking: Chapter 8. Building and submitting jobs 197
  • 218. Table 47. Unlimited blocking Machine Available Initiators Assigned Tasks N3 8 8 N2 6 6 N1 4 3 The assignment function begins with N3 (because N3 has the most initiators available), and assigns 8 tasks, N2 takes six, and N1 takes the remaining 3. task_geometry The task_geometry keyword allows you to specify which tasks run together on the same machines, although you cannot specify which machines. In this example, the task_geometry keyword groups 7 tasks to run on 3 nodes: # @ task_geometry = {(5,2)(1,3)(4,6,0)} The entire task_geometry expression must be enclosed within braces. The task IDs for each node must be enclosed within parenthesis, and must be separated by commas. The entire range of task IDs that you specify must begin with zero, and must end with the task ID which is one less than the total number of tasks. You can specify the task IDs in any order, but you cannot skip numbers (the range of task IDs must be complete). Commas may only appear between task IDs, and spaces may only appear between nodes and task IDs. Submitting jobs that use striping When communication between parallel tasks occurs only over a single device such as en0, the application and the device are gated by each other. The device must wait for the application to fill a communication buffer before it transmits the buffer and the application must wait for the device to transmit and empty the buffer before it can refill the buffer. Thus the application and the device must wait for each other and this wastes time. The technique of striping refers to using two or more communication paths to implement a single communication path as perceived by the application. As the application sends data, it fills up a buffer on one device. As that buffer is transmitted over the first device, the application’s data begins filling up a second buffer and the application perceives no delay in being able to write. When the second buffer is full, it begins transmission over the second device and the application moves on to the next device. When all devices have been used, the application returns to the first device. Much, if not all of the buffer on the first device has been transmitted while the application wrote to the buffers on the other devices so the application waits for a minimal amount of time or possibly does not wait at all. LoadLeveler supports striping in two ways. When multiple switch planes or networks are present, striping over them is indicated by requesting sn_all (multiple networks). If multiple adapters are present on the same network and the communication subsystem, such as LAPI, supports striping over multiple adapters on the same network, specifying the instances keyword on the network statement requests striping over adapters on the same network. The instances keyword specifies the number of adapters on a single network to stripe on. It is possible to stripe over 198 TWS LoadLeveler: Using and Administering
  • 219. multiple networks and over multiple adapters on each network by specifying both sn_all and a value for instances greater than one. For HPS adapters, only machines that are connected to both networks are considered for sn_all jobs. v User space striping: When sn_all is specified on a network statement with US mode, LoadLeveler commits an equivalent set of adapter resources (adapter windows and memory) on each of the networks present in the system to the job on each node where the job runs. The communication subsystem is initialized to indicate that it should use the user space communication protocol on all the available switch adapters to service communication requests on behalf of the application. v IP striping: When the sn_all device is specified on a network statement with the IP mode, LoadLeveler attempts to locate the striped IP address associated with the switch adapters, known as the multi-link address. If it is successful, it passes the multi-link address to POE for use. If multi-link addresses are not available, LoadLeveler instructs POE to use the IP address of one of the switch adapters. The IP address that is used is different each time a choice has to be made in an attempt to balance the adapter use. Multi-link addresses must be configured on the system prior to running LoadLeveler and they are specified with the multilink_address keyword on the switch adapter stanza in the administration file. If a multi-link address is specified for a node, LoadLeveler assigns the multi-link address and multi-link IP name to the striping adapter on that node. If a multi-link address is not present on a node, the sn_all adapter associated with the node will not have an IP address or IP name. If not all of the nodes of a system have multi-link addresses but some do, LoadLeveler will only dispatch jobs that request IP striping to nodes that have multi-link addresses. Jobs that request striping (both user space and IP) can be submitted to nodes with only one switch adapter. In that situation, the result is the same as if the job requested no striping. Note: When configured, a multi-link address is associated with the virtual ml0 device. The IP address of this device is the multi-link address. The llextRPD program will create a stanza for the ml0 device that will appear similar to Ethernet or token ring adapter stanzas except that it will include the multilink_list keyword that lists the adapters it performs striping over. As with any other device with an IP address, the ml0 device can be requested in IP mode on the network statement. Doing so would yield a comparable effect to requesting sn_all IP except that no checking would be performed by LoadLeveler to ensure the associated adapters are actually working. Thus it would be possible to dispatch a job that requested communication over ml0 only to have the job fail because the switch adapters that ml0 stripes over were down. v Striping over one network: If the instances keyword is specified on a network statement with a value greater than one, LoadLeveler allocates multiple sets of resources for the protocol using as many sets as the instances keyword specified. For User Space jobs, these sets are adapter windows and memory. For IP jobs, these sets are IP addresses. If multiple adapters exist on each node on the same network, then these sets of adapter resources will be distributed among all the available adapters on the same network. Even though LoadLeveler will allocate resources to support striping over a single network, the communication subsystem must be capable of exploiting these resources in order for them to be used. Understanding striping over multiple networks Striping over multiple networks involves establishing a communication path using one or more of the available communication networks or switch fabrics. Chapter 8. Building and submitting jobs 199
  • 220. How those paths are established depends on the network adapter that is present. For the SP Switch2 family of adapters, it is not necessary to acquire communication paths among all tasks on all fabrics as long as there is at least one fabric over which all tasks can communicate. However, each adapter on a machine, if it is available, must use exactly the same adapter resources (window and memory amount) as the other adapters on that machine. Switch Network Interface for HPS adapters are not required to use exactly the same resources on each network, but in order for a machine to be selected, there must be an available communication path on all networks. Node 1 Adapter A fault Adapter B Node 2 Adapter A Adapter B Switch Switch Network A Node 3 Network B Adapter A fault Adapter B Node 4 Adapter A fault Adapter B Figure 24. Striping over multiple networks Consider these sample scenarios using the network configuration as shown in Figure 24 where the adapters are from the SP Switch2 family: v If a three node job requests striping over networks, it will be dispatched to Node 1, Node 2 and Node 4 where it can communicate on Network B as long as the adapters on each machine have a common window free and sufficient memory available. It cannot run on Node 3 because that node only has a common communication path with Node 2, namely Network A. v If a three node job does not request striping, it will not be run because there are not enough adapters connected to Network A to run the job. Notice both the adapter connected to Network A on Node 1 and the adapter connected to Network A on Node 4 are both at fault. SP Switch2 family adapters can only use the adapter connected to Network A for non-striped communication. 200 TWS LoadLeveler: Using and Administering
  • 221. v If a three node job requests striped IP and some but not all of the nodes have multi-linked addresses, the job will only be dispatched to the nodes that have the multi-link addresses. Consider these sample scenarios using the network configuration as shown in Figure 24 on page 200 where the adapters are Switch Network Interface for HPS adapters: v If a three node job requests striping over networks, it will not be dispatched because there are not three nodes that have active connections to both networks. v If a three node job does not request striping, it can be run on Node 1, Node 2, and Node 4 because they have an active connection to network B. v If a three node job requests striped IP and some but not all of the nodes have multi-linked addresses, the job will only be dispatched to the nodes that have the multi-link addresses. Note that for all adapter types, adapters are allocated to a step that requests striping based on what the node knows is the available set of networks or fabrics. LoadLeveler expects each node to have the same knowledge about available networks. If this is not true, it is possible for tasks of a step to be assigned adapters which cannot communicate with tasks on other nodes. Similarly, LoadLeveler expects all adapters that are identified as being on the same Network ID or fabric ID to be able to communicate with each other. If this is not true, such as when LoadLeveler operates with multiple, independent sets of networks, other attributes of the Step, such as the requirements expression, must be used to ensure that only nodes from a single network set are considered for the step. As you can see from these scenarios, LoadLeveler will find enough nodes on the same communication path to run the job. If enough nodes connected to a common communication path cannot be found, no communication can take place and the job will not run. Understanding striping over a single network Striping over a single network is only supported by Switch Network Interface for HPS adapters. Figure 25 on page 202 shows a network configuration where the adapters support striping over a single network. Chapter 8. Building and submitting jobs 201
  • 222. Instance 0 Node 1 Instance 1 A Instance 2 Adapter A B Adapter B Node 2 Switch A Network 0 Adapter A Adapter B B Node 3 A Adapter A A Adapter B fault Figure 25. Striping over a single network Both Adapter A and Adapter B on a node are connected to Network 0. The entire oval represents the physical network and the concentric ovals (shaded differently) represent the separate communication paths created for a job by the instances keyword on the network statement. In this case a three node job requests two instances for communication. On Node 1, adapter A is used for instance 0 and adapter B is used for instance 1. There is no requirement to use the same adapter for the same instance so on Node 2, adapter B was used for instance 0 and adapter A for instance 1. On Node 3, where a fault is keeping adapter B from connecting to the network, adapter A is used for both instance 0 and instance 1 and Node 3 is available for the job to use. The network itself does not impose any limitation on the total number of communication paths that can be active at a given time for either a single job or all the jobs using the network. As long as nodes with adapter resources are available, additional communication paths can be created. Examples: Requesting striping in network statements You request that a job be run using striping with the network statement in your job command file. The default when instances is not specified for a job in the network statement is controlled by the class stanza keyword for sn_all. For more information on the network and max_protocol_instances statements, see the keyword descriptions in “Job command file keyword descriptions” on page 359. Shown here are examples of IP and user space network modes: v Example 1: Requesting striping using IP mode To submit a job using IP striping, your network statement would look like this: 202 TWS LoadLeveler: Using and Administering
  • 223. network.MPI = sn_all,,IP v Example 2: Requesting striping using user space mode To submit a job using user space striping, your network statement would look like this: network.MPI = sn_all,,US v Example 3: Requesting striping over a single network To request IP striping over multiple adapter on a single network, the network statement would look like this: network.MPI = sn_single,,IP,,instances=2 If the nodes on which the job runs have two or more adapters on the same network, two different IP addresses will be allocated to each task for MPI communication. If only one adapter exists per network, the same IP address will be used twice for each task for MPI communication. v Example 4: Requesting striping over multiple networks and multiple adapters on the same network To submit a user space job that will stripe MPI communication over multiple adapters on all networks present in the system the network statement would look like this: network.MPI = sn_all,,US,,instances=2 If, on a node where the job runs, there are two adapters on each of the two networks, one adapter window would be allocated from each adapter for MPI communication by the job. If only one network were present with two adapters, one adapter window from each of the two adapters would be used. If two networks were present but each only had one adapter on it, two adapter windows from each adapter would be used to satisfy the request for two instances. Running interactive POE jobs POE will accept LoadLeveler job command files However, you can still set the following environment variables to define specific LoadLeveler job attributes before running an interactive POE job: LOADL_ACCOUNT_NO The account number associated with the job. LOADL_INTERACTIVE_CLASS The class to which the job is assigned. MP_TASK_AFFINITY The affinity preferences requested for the job. For information on other POE environment variables, see IBM Parallel Environment for AIX and Linux: Operation and Use, Volume 1. For an interactive POE job, LoadLeveler does not start the POE process therefore LoadLeveler has no control over the process environment or resource limits. You also may run interactive POE jobs under a reservation. For additional details about reservations and submitting jobs to run under them, see “Working with reservations” on page 213. Interactive POE jobs cannot be submitted to a remote cluster. Chapter 8. Building and submitting jobs 203
  • 224. Running MPICH, MVAPICH, and MPICH-GM jobs | LoadLeveler for AIX andLoadLeveler for Linux support three open-source | implementations of the Message-Passing Interface (MPI). MPICH is an open-source, portable implementation of the MPI Standard developed by Argonne National Laboratory. It contains a complete implementation of version 1.2 of the MPI Standard and also significant parts of MPI-2, particularly in the area of parallel I/O. MPICH, MVAPICH, and MPICH-GM are the three MPI | implementations supported by LoadLeveler for AIX and LoadLeveler for Linux: v Additional documentation for MPICH is available from the Argonne National Laboratory web site at: http://www-unix.mcs.anl.gov/mpi/mpich1/ v MVAPICH is a high performance implementation of MPI-1 over InfiniBand based on MPICH. Additional documentation for MVAPICH is available at the Ohio State University Web site at: http://mvapich.cse.ohio-state.edu/ v MPICH-GM is a port of MPICH on top of GM (ch_gm). GM is a low-level message-passing system for Myrinet Networks. Additional documentation for MPICH-GM is available from the Myrinet web site at: http://www.myri.com/scs/ For either MPICH, MVAPICH, or MPICH-GM, LoadLeveler allocates the machines to run the parallel job and starts the implementation specific script as master task. Some of the options of implementation specific scripts might not be required or are not supported when used with LoadLeveler. The following standard mpirun script options are not supported: -map <list> The mpirun script can either take a machinefile or a mapping of the machines in which to run the mpirun job. If both the machinefile and map are specified, then the map list overrides the machinefile. Because we want LoadLeveler to decide which nodes to run on, use the machinefile specified by the environment variable LOADL_HOSTFILE. Specifying a mapping of the host name is not supported. -allcpus This option is only supported when the -machinefile option is used. The mpirun script will run the job using all machines specified in the machine file, without the need to specify the -np option. Without specifying machinefile, the mpirun script will look in the default machines <arch> file to find the machines on which to run the job. The machines defined in the default file might not match what LoadLeveler has selected, which will cause the job to be removed. -exclude <list> This option is not supported because if you specified a machine in the exclude list that has already been scheduled by LoadLeveler to run the job, the job will be removed. -dbg This option might be used to select a debugger. This option is used to select a debugger to be used with the mpirun script. LoadLeveler currently does not support running interactive MPICH jobs, so starting mpirun jobs under a debugger is not supported. 204 TWS LoadLeveler: Using and Administering
  • 225. -ksq This option keeps the send queue. This is useful if you expect later to attach totalview to the running (or deadlocked) job, and want to see the send queues. This option is used for debugging purposes when attaching the mpirun job to totalview. Since we do not support running debuggers under LoadLeveler MPICH job management, this option is not supported. -machinedir <directory> This option looks for the machine files in the indicated directory. LoadLeveler will create a machinefile that contains the host name for each task in the mpirun job. The environment variable LOADL_HOSTFILE contains the full path to the machinefile. A different machinefile is created per job and stored in the LoadLeveler execute directory. Because there might be multiple jobs running at one time, we do not want the mpirun script to choose any file in the execute directory because it might not be the correct file that the central manager has assigned to the job step. This option is therefore not supported, use the -machinefile option instead. v When using MPICH, the mpirun script is run on the first machine allocated to the job. The mpirun script starts the actual execution of the parallel tasks on the other nodes included in the LoadLeveler cluster using llspawn.stdio as RSHCOMMAND. The following option of MPICHs mpirun script is not supported. -nolocal This option specifies not to run on the local machine. The default behavior of MPICH (p4) is that the first MPI process is always spawned on the machine which mpirun has invoked. The -nolocal option disables the default behavior and does not run the MPI process on the local node. Under LoadLeveler’s MPICH Job management, it is required that at least one task is run on the local node, so the -nolocal option should not be used. v When using MVAPICH, the mpirun_rsh command is run on the first machine allocated to the job as master task. The mpirun_rsh command starts the actual execution of parallel tasks on the other nodes included in the LoadLeveler cluster using llspawn as RSHCOMMAND. The following options of MVAPICHs mpirun_rsh command are not supported when used with LoadLeveler. -rsh Specifies to use rsh for connecting. -ssh Specifies to use ssh for connecting. The -rsh and -ssh options are supported, but the behavior has been changed to run mpirun_rsh jobs under LoadLeveler MPICH job manager. Replace the -rsh and -ssh commands with llspawn before compiling mpirun_rsh. Even if you select -rsh and -ssh, the llspawn command is actually used in place of -rsh and -ssh at runtime. -xterm Runs remote processes under xterm. This option starts an xterm window for each task in the mpirun job and runs the remote shell with the application inside the xterm window. This will not work under LoadLeveler because the llspawn command replaces the remote shell (rsh or ssh) and llspawn is not kept alive to the end of the application process. -debug Runs each process under the control of gdb. This option is used to select a debugger to be used with mpirun jobs. LoadLeveler currently does not support running interactive MPICH jobs so starting mpirun jobs under a Chapter 8. Building and submitting jobs 205
  • 226. debugger is not supported. This option also requires xterm to be working properly as it opens gdb under an xterm window. Since we do not support the -xterm option, the -debug option is also not supported. h1 h2.... Specifies the names of hosts where processes should run. The mpirun_rsh script can either take a host file or read in the names of the hosts, h1 h2 and so on, in which to run the mpirun job. If both host file and list of machines are specified in the mpirun_rsh arguments, mpirun_rsh will have an error parsing the arguments. Because we want LoadLeveler to decide which nodes to run on, you should use the host list specified by the environment variable LOADL_HOSTFILE. Specifying the names of the hosts is not supported. v When using MPICH-GM, the mpirun.ch_gm script is run on the first machine allocated to the job as master task. The mpirun.ch_gm script starts the actual execution of the parallel tasks on the other nodes included in the LoadLeveler cluster using the llspawn command as RSHCOMMAND. The following options of MPICH-GMs mpirun script are not supported when used with LoadLeveler. --gm-kill <n> This is an option that allows you to kill all remaining processes <n> seconds after the first one dies or exits. Do not specify this option when running the application under LoadLeveler, because LoadLeveler will handle the cleanup of the tasks. --gm-tree-spawn This is an option that uses a two-level spawn tree to launch the processes in an effort to reduce the load on any particular host. Because LoadLeveler is providing its own scalable method for spawning the application tasks from the master host, using the llspawn command, spawning processes in a tree-like fashion is not supported. -totalview This option is used to select a totalview debugging session to be used with the mpirun script. LoadLeveler currently does not support running interactive MPICH jobs, so starting mpirun jobs under a debugger is not supported. -r This is an optional option for MPICH-GM, which forces the removal of the shared memory files. Because this option is not required, it is not supported. If you specify this option, it will be ignored. -ddt This option is used to select a DDT debugging session to be used with the mpirun script. LoadLeveler currently does not support running interactive MPICH jobs, so starting mpirun jobs under a debugger is not supported. Sample programs are available: v See “MPICH sample job command file” on page 208 for a sample MPICH job command file. v See “MPICH-GM sample job command file” on page 209 for a sample MPICH-GM job command file. v See “MVAPICH sample job command file” on page 211 for a sample MVAPICH job command file. v The LoadLeveler samples directory also contains sample files: – On AIX, use directory /usr/lpp/LoadL/full/samples/llmpich – On Linux, use directory /opt/ibmll/LoadL/full/samples/llmpich 206 TWS LoadLeveler: Using and Administering
  • 227. These sample files include: – ivp.c: A simple MPI application that you may run as an MPICH, MVAPICH, or MPICH-GM job. – Job command files to run the ivp.c program as a batch job: - For MPICH: mpich_ivp.cmd - For MPICH-GM: mpich_gm_ivp.cmd Examples: Building parallel job command files This topic contains sample job command files for several parallel environments. This topic contains sample job command files for the following parallel environments: v IBM AIX Parallel Operating Environment (POE) v MPICH v MPICH-GM v MVAPICH POE sample job command file This is a sample job command file for POE. Figure 26 is a sample job command file for POE. # # @ job_type = parallel # @ environment = COPY_ALL # @ output = poe.out # @ error = poe.error # @ node = 8,10 # @ tasks_per_node = 2 # @ network.LAPI = sn_all,US,,instances=1 # @ network.MPI = sn_all,US,,instances=1 # @ wall_clock_limit = 60 # @ executable = /usr/bin/poe # @ arguments = /u/richc/My_POE_program -euilib "us" # @ class = POE # @ queue Figure 26. POE job command file – multiple tasks per node Figure 26 shows the following: v The total number of nodes requested is a minimum of eight and a maximum of 10 (node=8,10). Two tasks run on each node (tasks_per_node=2). Thus the total number of tasks can range from 16 to 20. v Each task of the job will run using the LAPI protocol in US mode with a switch adapter (network.LAPI=sn_all,US,,instances=1), and using the MPI protocol in US mode with a switch adapter (network.MPI=sn_all,US,,instances=1). v The maximum run time allowed for the job is 60 seconds (wall_clock_limit=60). Figure 27 on page 208 is a second sample job command file for POE Chapter 8. Building and submitting jobs 207
  • 228. # # @ job_type = parallel # @ input = poe.in.1 # @ output = poe.out.1 # @ error = poe.err # @ node = 2,8 # @ network.MPI = sn_single,shared,IP # @ wall_clock_limit = 60 # @ class = POE # @ queue /usr/bin/poe /u/richc/my_POE_setup_program -infolevel 2 /usr/bin/poe /u/richc/my_POE_main_program -infolevel 2 Figure 27. POE sample job command file – invoking POE twice Figure 27 shows the following: v POE is invoked twice, through my_POE_setup_program and my_POE_main_program. v The job requests a minimum of two nodes and a maximum of eight nodes (node=2,8). v The job by default runs one task per node. v The job uses the MPI protocol with a switch adapter in IP mode (network.MPI=sn_single,shared,IP). v The maximum run time allowed for the job is 60 seconds (wall_clock_limit=60). MPICH sample job command file This is a sample job command file for MPICH. Figure 28 is a sample job command file for MPICH. # ! /bin/ksh # LoadLeveler JCF file for running an MPICH job # @ job_type = MPICH # @ node = 4 # @ tasks_per_node = 2 # @ output = mpich_test.$(cluster).$(process).out # @ error = mpich_test.$(cluster).$(process).err # @ queue echo "------------------------------------------------------------" echo LOADL_STEP_ID=$LOADL_STEP_ID echo "------------------------------------------------------------" /opt/mpich/bin/mpirun -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test Figure 28. MPICH job command file - sample 1 Note: You can also specify the job_type=parallel keyword and invoke the mpirun script to run an MPICH job. In that case, the mpirun script would use rsh or ssh and not the llspawn command. Figure 28 shows that in the following job command file statement: /opt/mpich/bin/mpirun -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test -np Specifies the number of parallel processes. LOADL_TOTAL_TASKS Is the environment variable set by LoadLeveler with the number of parallel processes of the job step. 208 TWS LoadLeveler: Using and Administering
  • 229. -machinefile Specifies the machine list file. LOADL_HOSTFILE Is the environment variable set by LoadLeveler with the file name that contains host names assigned to the parallel job step. The following is another example of a MPICH job command file: # ! /bin/ksh # LoadLeveler JCF file for running an MPICH job # @ job_type = MPICH # @ node = 4 # @ tasks_per_node = 2 # @ output = mpich_test.$(cluster).$(process).out # @ error = mpich_test.$(cluster).$(process).err # @ executable = /opt/mpich/bin/mpirun # @ arguments = -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test # @ queue Figure 29. MPICH job command file - sample 2 Figure 29 shows the following: v The mpirun script is specified as a value of the executable job command file keyword. v The following mpirun script arguments are specified with the arguments job command file keyword: -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test -np Specifies the number of parallel processes. LOADL_TOTAL_TASKS Is the environment variable set by LoadLeveler with the number of parallel processes of the job step. -machinefile Specifies the machine list file. LOADL_HOSTFILE Is the environment variable set by LoadLeveler with file name, which contains host names assigned to the parallel job step. MPICH-GM sample job command file This is a sample job command file for MPICH-GM. Figure 30 on page 210 is a sample job command file for MPICH-GM. Chapter 8. Building and submitting jobs 209
  • 230. #! /bin/ksh # LoadLeveler JCF file for running an MPICH-GM job # @ job_type = MPICH # @ resources = gmports(1) # @ node = 4 # @ tasks_per_node = 2 # @ output = mpich_gm_test.$(cluster).$(process).out # @ error = mpich_gm_test.$(cluster).$(process).err # @ queue echo "------------------------------------------------------------" echo LOADL_STEP_ID=$LOADL_STEP_ID echo "------------------------------------------------------------" /opt/mpich/bin/mpirun.ch_gm -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test Figure 30. MPICH-GM job command file - sample 1 Figure 30 shows the following: v The statement # @ resources = gmports(1) specifies that each task consumes one GM port. This is how LoadLeveler limits the number of GM ports simultaneously in use on any machine. This resource name is the name you specified in schedule_by_resources in the configuration file and each machine stanza in the administration file must define GM ports and specify the quantity of GM ports available on each machine. Use the llstatus -R command to confirm the names and values of the configured and available consumable resources. v In the following job command file statement: /opt/mpich/bin/mpirun.ch_gm -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test /opt/mpich/bin/mpirun.ch_gm Specifies the location of the mpirun.ch_gm script shipped with the MPICH-GM implementation that runs the MPICH-GM application. -np Specifies the number of parallel processes. -machinefile Specifies the machine list file. LOADL_HOSTFILE Is the environment variable set by LoadLeveler with file name, which contains host names assigned to the parallel job step. Figure 31 is another sample job command file for MPICH-GM. #! /bin/ksh # LoadLeveler JCF file for running an MPICH-GM job # @ job_type = MPICH # @ resources = gmports(1) # @ node = 4 # @ tasks_per_node = 2 # @ output = mpich_gm_test.$(cluster).$(process).out # @ error = mpich_gm_test.$(cluster).$(process).err # @ executable = /opt/mpich/bin/mpirun.ch_gm # @ arguments = -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test # @ queue Figure 31. MPICH-GM job command file - sample 2 Figure 31 shows the following: v The mpirun_gm script is specified as value of the executable job command file keyword. 210 TWS LoadLeveler: Using and Administering
  • 231. v The following mpirun_gm script arguments are specified with the arguments job command file keyword: -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test -np Specifies the number of parallel processes. LOADL_TOTAL_TASKS Is the environment variable set by LoadLeveler with the number of parallel processes of the job step. -machinefile Specifies the machine list file. LOADL_HOSTFILE Is the environment variable set by LoadLeveler with file name, which contains host names assigned to the parallel job step. MVAPICH sample job command file This is a sample job command file for MVAPICH. Figure 32 is a sample job command file for MVAPICH: # ! /bin/ksh # LoadLeveler JCF file for running an MVAPICH job # @ job_type = MPICH # @ node = 4 # @ tasks_per_node = 2 # @ output = mvapich_test.$(cluster).$(process).out # @ error = mvapich_test.$(cluster).$(process).err # @ queue echo "------------------------------------------------------------" echo LOADL_STEP_ID=$LOADL_STEP_ID echo "------------------------------------------------------------" /opt/mpich/bin/mpirun_rsh -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test Figure 32. MVAPICH job command file - sample 1 Figure 32 shows that in the following job command file statement: /opt/mpich/bin/mpirun_rsh -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test -np Specifies the number of parallel processes. LOADL_TOTAL_TASKS Is the environment variable set by LoadLeveler with the number of parallel processes of the job step. -machinefile Specifies the machine list file. LOADL_HOSTFILE Is the environment variable set by LoadLeveler with file name, which contains host names assigned to the parallel job step. Figure 32 is another sample job command file for MVAPICH: Chapter 8. Building and submitting jobs 211
  • 232. # ! /bin/ksh # LoadLeveler JCF file for running an MVAPICH job # @ job_type = MPICH # @ node = 4 # @ tasks_per_node = 2 # @ output = mvapich_test.$(cluster).$(process).out # @ error = mvapich_test.$(cluster).$(process).err # @ executable = /opt/mpich/bin/mpirun_rsh # @ arguments = -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test # @ queue Figure 33. MVAPICH job command file - sample 2 Figure 33 shows the following: v The mpirun_rsh command is specified as value for the executable job command file keyword. v The following mpirun_rsh command arguments are specified with the arguments job command file keyword: -np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test -np Specifies the number of parallel processes. LOADL_TOTAL_TASKS Is the environment variable set by LoadLeveler with the number of parallel processes of the job step. -machinefile Specifies the machine list file. LOADL_HOSTFILE Is the environment variable set by LoadLeveler with file name, which contains host names assigned to the parallel job step. Obtaining status of parallel jobs Both end users and LoadLeveler administrators can obtain status of parallel jobs in the same way as they obtain status of serial jobs – either by using the llq command or by viewing the Jobs window on the graphical user interface (GUI). By issuing llq -l, or by using the Job Actions → Details selection in xloadl, users get a list of machines allocated to the parallel job. If you also need to see task instance information use the -x option in addition to the -l option (llq -l -x). See “llq - Query job status” on page 479 for samples of output using the -x and -l options with the llq command. Obtaining allocated host names llq -l output includes information on allocated host names. Another way to obtain the allocated host names is with the LOADL_PROCESSOR_LIST environment variable, which you can use from a shell script in your job command file as shown in Figure 34 on page 213. This example uses LOADL_PROCESSOR_LIST to perform a remote copy of a local file to all of the nodes, and then invokes POE. Note that the processor list contains an entry for each task running on a node. If two tasks are running on a node, LOADL_PROCESSOR_LIST will contain two instances of the host name where the tasks are running. The example in Figure 34 on page 213 removes any duplicate entries. 212 TWS LoadLeveler: Using and Administering
  • 233. Note that LOADL_PROCESSOR_LIST is set by LoadLeveler, not by the user. This environment variable is limited to 128 hostnames. If the value is greater than the 128 limit, the environment variable is not set. #!/bin/ksh # @ output = my_POE_program.$(cluster).$(process).out # @ error = my_POE_program.$(cluster).$(process).err # @ class = POE # @ job_type = parallel # @ node = 8,12 # @ network.MPI = sn_single,shared,US # @ queue tmp_file="/tmp/node_list" rm -f $tmp_file # Copy each entry in the list to a new line in a file so # that duplicate entries can be removed. for node in $LOADL_PROCESSOR_LIST do echo $node >> $tmp_file done # Sort the file removing duplicate entries and save list in variable nodelist= sort -u /tmp/node_list for node in $nodelist do rcp localfile $node:/home/userid done rm -f $tmp_file /usr/bin/poe /home/userid/my_POE_program Figure 34. Using LOADL_PROCESSOR_LIST in a shell script Working with reservations Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make reservations, which specify a time period during which specific node resources are reserved for use by particular users or groups. Use Table 48 to find information about working with reservations. Table 48. Roadmap of tasks for reservation owners and users Subtask Associated instructions (see . . . ) Learn how reservations work in the v “Overview of reservations” on page 25 LoadLeveler environment v “Understanding the reservation life cycle” on page 214 Creating new reservations “Creating new reservations” on page 216 Managing jobs that run under a v “Submitting jobs to run under a reservation” on reservation page 218 v “Removing bound jobs from the reservation” on page 220 Managing existing reservations v “Querying existing reservations” on page 221 v “Modifying existing reservations” on page 221 v “Canceling existing reservations” on page 222 Chapter 8. Building and submitting jobs 213
  • 234. Table 48. Roadmap of tasks for reservation owners and users (continued) Subtask Associated instructions (see . . . ) Using the LoadLeveler interfaces for v Chapter 16, “Commands,” on page 411 reservations v “Reservation API” on page 643 Understanding the reservation life cycle From the time at which LoadLeveler creates a reservation through the time the reservation ends or is canceled, a reservation goes through various states, which are indicated in command listings and other displays or output. Understanding these states is important because the current state of a reservation dictates what actions you can take; for example, if you want to modify the start time for a reservation, you may do so only while the reservation is in Waiting state. Table 49 lists the possible reservation states, their abbreviations, and usage notes. Table 49. Reservation states, abbreviations, and usage notes Reservation Abbreviation Usage notes state in displays / output | Waiting W Reservations are in the Waiting state: | 1. When LoadLeveler first creates a reservation. | 2. After one occurrence of a recurring reservation ends | and before the next occurrence starts. | While the reservation is in the Waiting state: v Only administrators and reservation owners may modify, cancel, and add users or groups to the reservation. v Administrators, reservation owners, and users or groups that are allowed to use the reservation may query it, and submit jobs to run during the reservation period. 214 TWS LoadLeveler: Using and Administering
  • 235. Table 49. Reservation states, abbreviations, and usage notes (continued) Reservation Abbreviation Usage notes state in displays / output Setup S LoadLeveler changes the state of a reservation from Waiting to Setup just before the start time of the reservation. The actual time at which LoadLeveler places the reservation in Setup state depends on the value set for the RESERVATION_SETUP_TIME keyword in the configuration file. While the reservation is in Setup state: v Only administrators and reservation owners may modify, cancel, and add users or groups to the reservation. v Administrators, reservation owners, and users or groups that are allowed to use the reservation may query it, and submit jobs to run during the reservation period. During this setup period, LoadLeveler: v Stops scheduling unbound job steps to reserved nodes. v Preempts any jobs that are still running on the nodes that are reserved through this reservation. To preempt the running jobs, LoadLeveler uses the preemption method specified through the DEFAULT_PREEMPT_METHOD keyword in the configuration file. Note: The default value for DEFAULT_PREEMPT_METHOD is SU (suspend), which is not supported in all environments, and the default value for PREEMPTION_SUPPORT is NONE. If you want preemption to take place at the start of the reservation, make sure the cluster is configured for preemption (see “Steps for configuring a scheduler to preempt jobs” on page 130 for more information). Active A At the reservation start time, LoadLeveler changes the reservation state from Setup to Active. It also dispatches only job steps that are bound to the reservation, until the reservation completes or is canceled. LoadLeveler does not dispatch bound job steps that: v Require certain resources, such as floating consumable resources, that are not available during the reservation period. v Have expected end times that exceed the end time of the reservation. By default, LoadLeveler allows such jobs to run, but their completion is subject to resource availability. (An administrator may configure LoadLeveler to prevent such jobs from running.) These bound job steps remain idle unless the required resources become available. While the reservation is in Active state: v Only administrators and reservation owners may modify, cancel, and add users or groups to the reservation. v Administrators, reservation owners, and users or groups that are allowed to use the reservation may query it, and submit jobs to run during the reservation period. Chapter 8. Building and submitting jobs 215
  • 236. Table 49. Reservation states, abbreviations, and usage notes (continued) Reservation Abbreviation Usage notes state in displays / output Active_Shared AS At the reservation start time, LoadLeveler changes the reservation state from Setup to Active. It also dispatches only job steps that are bound to the reservation, unless the reservation was created with the SHARED mode. In this case, if reserved resources are still available after LoadLeveler dispatches any bound job steps that are eligible to run, LoadLeveler changes the reservation state to Active_Shared, and begins dispatching job steps that are not bound to the reservation. Once the reservation state changes to Active_Shared, it remains in that state until the reservation completes or is canceled. During this time, LoadLeveler dispatches both bound and unbound job steps, pending resource availability; bound job steps are considered before unbound job steps. The conditions under which LoadLeveler will not dispatch bound job steps are the same as those listed in the notes for the Active state. The actions that administrators, reservation owners, and users may perform are the same as those listed in the notes for the Active state. Canceled CA When a reservation owner, administrator, or LoadLeveler issues a request to cancel the reservation, LoadLeveler changes the state of a reservation to Canceled and unbinds any job steps bound to this reservation. When the reservation is in this state, no one can modify or submit jobs to this reservation. Complete C When a reservation end time is reached, LoadLeveler changes the state of a reservation to Complete. When the reservation is in this state, no one can modify or submit jobs to this reservation. Creating new reservations You must be an authorized user or member of an authorized group to successfully create a reservation. LoadLeveler administrators define authorized users by adding the max_reservations keyword to the user or group stanza in the administration file. The max_reservations keyword setting also defines how many reservations you are allowed to own. Ask your administrator whether you are authorized to create reservations. To be authorized to create reservations, LoadLeveler administrators also must have the max_reservations keyword set in their user or group stanza. | To create a reservation, use the llmkres command. Specify the start time of the | reservation using the -t command option and the duration of the reservation using | the -d command option. If you are creating a recurring reservation, you must use | the -t option to specify the schedule for that reservation. 216 TWS LoadLeveler: Using and Administering
  • 237. | In addition to the start time and duration (or reservation schedule), you must also | use one of the following methods to specify how you want to select nodes for the | reservation. | Note: These methods are mutually exclusive. v The -n option on the llmkres command instructs LoadLeveler to reserve a number of nodes. LoadLeveler may select any unreserved node to satisfy a reservation. This command option is perhaps the easiest to use, because you need to know only how many nodes you want, not specific node characteristics. The minimum number of nodes a reservation must have is 1. v The -h option on the llmkres command instructs LoadLeveler to reserve specific nodes. v The -f option on the llmkres command instructs LoadLeveler to submit the specified job command file, and reserve appropriate nodes for the first job step in the job command file. Through this action, all job steps for the job are bound to the reservation. If the reservation request fails, LoadLeveler changes the state for all job steps for this job to NotQueued, and will not schedule any of those job steps to run. v The -j option on the llmkres command instructs LoadLeveler to reserve appropriate nodes for that job step. Through this action, the job step is bound to the reservation. If the reservation request fails, the job step remains in the same state as it was before. v The -c option on the llmkres command instructs LoadLeveler to reserve a number of Blue Gene compute nodes (C-nodes). The -j and -f option also reserve Blue Gene resources if the job type is bluegene. You also may define other reservation attributes, including: v Whether additional users or groups are allowed to use the reservation. Use the -U or -G command options, respectively. v Whether the reservation will be in one or both of these optional modes: – SHARED mode: When you use the -s command option, LoadLeveler allows reserved resources to be shared by job steps that are not associated with a reservation. This mode enables the efficient use of reserved resources; if the bound job steps do not use all of the reserved resources, LoadLeveler can schedule unbound job steps as well so the resources do not remain idle. Unless you specify this mode, however, only job steps bound to the reservation may use the reserved resources. – REMOVE_ON_IDLE mode: When you use the -i command option, LoadLeveler automatically cancels the reservation when all bound job steps that can run finish running. Using this mode is efficient because it prevents LoadLeveler from wasting reserved resources when no jobs are available to use them. Selecting this mode is especially useful for workloads that will run unattended. | v The default binding method to use when jobs are bound to the reservation. Use | the -m option to specify whether the soft or firm binding method should be | used when the binding method is not specified by the llbind command. | – Soft binding allows the bound job to use resources outside of the reservation. | – Firm binding restricts the job to the reserved resources. | v For a recurring reservation, when the reservation will expire. Use the -e option | to specify the expiration date of the recurring reservation. Additional rules apply to the use of these options; see “llmkres - Make a reservation” on page 459 for details. Chapter 8. Building and submitting jobs 217
  • 238. | Alternative: Use the ll_make_reservation and the ll_init_reservation_param | subroutines in a program. Tips: | v If your user ID is not authorized to create any type of reservation but you are member of a group with authority to create reservations, you must use the -g option to specify the name of the authorized group on the llmkres command. v Only reservations in waiting and in use are counted toward the limit of allowed reservations set through the max_reservations keyword. LoadLeveler does not | count reservations or recurring reservations that have already ended or are in the process of being canceled. | v For accounting purposes, although recurring reservations have multiple | instances, a recurring reservation counts as one reservation no matter how many | times it may recur during its reservation period. | v Although you may create more than one reservation or recurring reservation for a particular node or set of nodes, only one of those reservations may be active at a time. If LoadLeveler determines that the reservation you are requesting will overlap with another reservation, LoadLeveler fails the create request. No reservation periods for the same set of machines can overlap. If the create request is successful, LoadLeveler assigns and returns to the owner a unique reservation identifier, in the form host.rid.r, where: host The name of the machine which assigned the reservation identifier. rid A number assigned to the reservation by LoadLeveler. r The letter r is used to distinguish a reservation identifier from a job step identifier. The following are examples of reservation identifiers: c94n16.80.r c94n06.1.r For details about the LoadLeveler interfaces for creating reservations, see: v “llmkres - Make a reservation” on page 459. v “ll_make_reservation subroutine” on page 653 and “ll_init_reservation_param subroutine” on page 652. Submitting jobs to run under a reservation LoadLeveler administrators, reservation owners, and authorized users may submit jobs to run under a reservation. You may bind both batch and interactive POE job steps to a reservation, both before a reservation starts or while it is active. Before you begin: v If you are a reservation owner and used the -f or -j options on the llmkres command when you created the reservation, you do not have to perform the steps listed in Table 50 on page 219. Those command options automatically bind the job steps to the reservation. To find out whether a particular job step is bound to a reservation, use the command llq -l and check the listing for a reservation ID. v To find out which reservation IDs you may use, check with your LoadLeveler administrator, or enter the command llqres -l and check the names in the Users or Groups fields (under the Modification time field) in the output listing. If your 218 TWS LoadLeveler: Using and Administering
  • 239. user name or a group name to which you belong appears in these output fields, you are authorized to use the reservation. v LoadLeveler cannot guarantee that certain resources will be available during a reservation period. If you submit job steps that require these resources, LoadLeveler will bind the job steps to the reservation, but will not dispatch them unless the resources become available during the reservation. These resources include: – Specific nodes that were not reserved under this reservation. – Floating consumable resources for a cluster. – Resources that are not released through preemption, such as virtual memory and adapters. v Whether bound job steps are successfully dispatched depends not only on resource availability, but also on administration file keywords that set maximum numbers, including: – max_jobs_scheduled – maxidle – maxjobs – maxqueued If LoadLeveler determines that scheduling a bound job will exceed one or more of these configured limits, your job will remain idle unless conditions permit scheduling at a later time during the reservation period. Table 50. Instructions for submitting a job to run under a reservation To bind this type of job: Use these instructions: Already Use the llbind command submitted | jobs Alternative: Use the ll_bind_reservation subroutine in a program. Result: LoadLeveler either sets the reservation ID for each job step that can be bound to the reservation, or sends a failure notification for the bind request. A new job 1. Specify the reservation ID through the LL_RES_ID environment variable that has not or the ll_res_id command file keyword. The ll_res_id keyword takes been precedence over the LL_RES_ID environment variable. submitted Tip: You can use the ll_res_id keyword to modify the reservation to submit to in a job command file filter. 2. Use the llsubmit command to submit the job. Result: If the job can be bound to the requested reservation, LoadLeveler sets the reservation ID for each job step that can be bound to the reservation. Otherwise, if the job step cannot be bound to the reservation, LoadLeveler changes the job state to NotQueued. To change the job step’s state to Idle, issue the llbind -r command. Use the llqres command or llq command with the -l option to check the success or failure of the binding request for each job step. | Selecting firm or soft binding: There are two methods by which a job step can be | bound to a reservation: firm and soft. When a job step is firm bound to a | reservation, the job step can only use the reserved resources. A job step that is soft | bound to a reservation can be started before the reservation becomes active and | can use nodes that are not part of the reservation. Using soft binding is a way of | guaranteeing that resources will be available for the job step at a given time, but | allowing the job step to start earlier if there are available resources. Chapter 8. Building and submitting jobs 219
  • 240. Which method to use is specified by the -m option of the llbind command. If neither is specified by llbind, the default method specified for the reservation is used. Use llqres -l and review the Binding Method field to determine which method is the default for a reservation. | Binding a job step to a recurring reservation: When a job step is bound to a | reservation, the job step can be considered for scheduling as soon as any | occurrence of the reservation is active. If you do not want the job step to run right | away, but instead you want it to run in a later occurrence of the reservation, you | can specify which occurrence the job step will be bound to by adding the | occurrence ID to the end of the reservation ID. | The format of the reservation identifier is [host.]rid[.r[.oid]]. | where: | v host is the name of the machine that assigned the reservation identifier. | v rid is the number assigned to the reservation when it was created. An rid is | required. | v r indicates that this is a reservation ID (r is optional if oid is not specified). | v oid is the occurrence ID of a recurring reservation (oid is optional). | When oid is specified, the job step will not be considered for scheduling until that | occurrence of the reservation becomes active. The step will remain in Idle state | during all earlier occurrences. | If a job step is bound to a recurring reservation, and the reservation occurrence’s | end time is reached before the job step can be scheduled to run, the job step will | be automatically bound to the next occurrence of the reservation by LoadLeveler. | When the next occurrence becomes active, the job step will again be considered for | scheduling. | A job can be submitted with the recurring keyword set to yes in the job command | file to specify that all steps of the job will be run in every occurrence of the | reservation to which it is bound. When all steps of the job have completed, the | entire job is requeued and all steps are bound to the next occurrence of the | reservation. For details about the LoadLeveler interfaces for submitting jobs under reservations, see: v “llbind - Bind job steps to a reservation” on page 415. v “ll_bind subroutine” on page 645. v “llsubmit - Submit a job” on page 531. Removing bound jobs from the reservation LoadLeveler administrators, reservation owners, and authorized users may use the llbind command to unbind one or more existing jobs from a reservation. | Alternative: Use the ll_bind_reservation subroutine in a program. Result: LoadLeveler either unbinds the jobs from the reservation, or sends a failure notification for the unbind request. Use the llqres or llq command to check the success or failure of the remove request. 220 TWS LoadLeveler: Using and Administering
  • 241. For details about the LoadLeveler interfaces for removing bound jobs from the reservation, see: v “llbind - Bind job steps to a reservation” on page 415. v “ll_bind subroutine” on page 645. Querying existing reservations | Any LoadLeveler administrator or user can issue the llqres and llq commands to | query the status of an existing reservation or recurring reservation. Use these commands to request specific information about reservations: v Various options are available to filter reservations to be displayed. v To show details of specific reservations, use the llqres command with the -l option. v To show job steps that are bound to specific reservations, use the llq command with the -R option. For details about: v Reservation attributes and llqres command syntax, see “llqres - Query a reservation” on page 500. v llq command syntax, see “llq - Query job status” on page 479. Modifying existing reservations Only administrators and reservation owners can use the llchres command to | modify one or more attributes of a reservation or a recurring reservation. Certain attributes cannot be changed after a reservation has become active. Typical uses for the llchres command include the following: v Using the command llchres -U +newuser1 newuser2 to allow additional users to submit jobs to the reservation. v If a reservation was made through the command llmkres -h free but LoadLeveler cannot include a particular node because it is down, you can use the command llchres -h +node to add the node to the reserved node list when that node becomes available again. v If a reserved node is down after the reservation becomes active, a LoadLeveler administrator can use: – The command llchres -h -node to remove that node from the reservation. – The command llchres -h +1 to add another node to the reservation. | v Extending the expiration of a recurring reservation which may be about to | expire. You can use llchres -e to specify a new expiration date for the | reservation without having to create a new reservation. | v Making a temporary change to the next occurrence of a recurring reservation | without affecting any future occurrences of that reservation. For example, you | can use the -o option of the llchres command to temporarily add a user (-U) or | additional nodes (-n). Once that occurrence ends, the next occurrence will not | retain the change. | Alternative: Use the ll_change_reservation subroutine in a program. For details about the LoadLeveler interfaces for modifying reservations, see: v “llchres - Change attributes of a reservation” on page 424. v “ll_change_reservation subroutine” on page 648. Chapter 8. Building and submitting jobs 221
  • 242. Canceling existing reservations | Administrators and reservation owners may use the llrmres command to cancel | one or more reservations or to cancel some occurrences of a recurring reservation | while leaving the remaining occurrences of that reservation unchanged in the | system. | The options available when canceling a reservation are: | v Remove the entire reservation. All occurrences are removed and any bound job | steps are automatically unbound from the reservation. | v Remove a specific occurrence of the reservation. All other occurrences remain in | the system and all bound job steps remain bound to the reservation. | v Remove all occurrences during a specified interval. For example, a reservation | may recur every day for one year, but during a one-week holiday period, the | reservation is not needed. The reservation owner could cancel all of the | occurrences during that one week period and all other occurrences would | remain in the system and all bound job steps would remain bound to the | reservation. | If some occurrences are canceled and the result is that no occurrences remain, then | the entire reservation is removed and all jobs are unbound from the reservation. | Alternative: Use the ll_remove_reservation subroutine in a program. Use the llqres command to check the success or failure of the remove request. | Use the llqres -l command to see a list of canceled occurrence IDs or to note | individual occurrence start times which have been omitted due to cancellation. For details about the LoadLeveler interfaces for canceling reservations, see: v “llrmres - Cancel a reservation” on page 508. v “ll_remove_reservation subroutine” on page 658. Submitting jobs requesting scheduling affinity You can request that a job use scheduling affinity by setting the RSET and TASK_AFFINITY job command file keywords. Specify RSET with a value of: v RSET_MCM_AFFINITY to have LoadLeveler schedule the job to machines where RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. v user_defined_rset to have LoadLeveler schedule the job to machines where RSET_SUPPORT is enabled with a value of RSET_USER_DEFINED; user_defined_rset is the name of a valid user-defined RSet. Specifying the RSET job command file keyword defaults to requesting memory affinity as a requirement and adapter affinity as a preference. Scheduling affinity options can be customized by using the job command file keyword MCM_AFFINITY_OPTIONS. For more information on these keywords, see “Job command file keyword descriptions” on page 359. Note: If a job specifies memory or adapter affinity scheduling as a requirement, LoadLeveler will only consider machines where RSET_SUPPORT is set to RSET_MCM_AFFINITY. If there are not enough machines satisfying the memory affinity requirements, the job will stay in the idle state. 222 TWS LoadLeveler: Using and Administering
  • 243. Specify TASK_AFFINITY with a value of: | v CORE(n) to have LoadLeveler schedule the job to machines where | RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. On SMT | and ST nodes, LoadLeveler will assign n physical CPUs to each job task. | v CPU(n) to have LoadLeveler schedule the job to machines where | RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. On SMT | nodes, LoadLeveler will assign n logical CPUs to each per job task. On ST | nodes, LoadLeveler will assign n physical CPUs to each job task. | Specify a requirement of SMT with a value of: | v Enabled to have LoadLeveler schedule the job to machines where SMT is | currently enabled. | Example: #@ requirements = (SMT == "Enabled") | v Disabled to have LoadLeveler schedule the job to machines where SMT is | currently disabled or is not supported. | Example: #@ requirements = (SMT == "Disabled") OpenMP multithreaded jobs can be submitted requesting thread-level binding, where each individual thread of an OpenMP application is bound to a separate physical core processor or logical CPU. Use the parallel_threads job command file keyword to request OpenMP thread-level binding, optionally, along with the task_affinity job command file keyword. The CPUs to individual OpenMP threads of the tasks are selected based on the number of parallel threads (the parallel_threads job command file keyword) in each task and set of CPUs or cores assigned (the task_affinity job command file keyword) to the tasks. The CPUs are assigned to the threads only if at least one CPU is available for each thread from the set of CPUs or cores assigned to the task. If the number of CPUs in the set of CPUs or cores assigned to the tasks are not sufficient to bind all of the threads, the job will not run. This example binds 4 OpenMP parallel threads to 4 separate cores: #@ task_affinity = Core(4) #@ parallel_threads = 4 | Note: If you specify cpus_per_core along with your affinity request as: | #@ task_affinity = core(n) | #@ cpus_per_core = 1 | Then LoadLeveler allocates the requested number of CPUs to each task on | SMT nodes only. The nodes running in ST mode are not assigned for the | jobs requesting cpus_per_core. Submitting and monitoring jobs in a LoadLeveler multicluster There are subtasks and associated instructions for submitting and monitoring jobs in a LoadLeveler multicluster. Table 51 on page 224 shows the subtasks and associated instructions for submitting and monitoring jobs in a LoadLeveler multicluster: Chapter 8. Building and submitting jobs 223
  • 244. Table 51. Submitting and monitoring jobs in a LoadLeveler multicluster Subtask Associated instructions (see . . . ) Prepare and submit a job “Steps for submitting jobs in a LoadLeveler multicluster in the LoadLeveler environment” multicluster Display information about v Use the llq -X cluster_name command to display information a job in the LoadLeveler about jobs on remote clusters. multicluster environment v Use llq -x -d to display the user’s job command file keyword statements. v Use llq -X cluster_name -l to obtain multicluster specific information. Transfer an idle job from Use the llmovejob command, which is described in “llmovejob one cluster to another - Move a single idle job from the local cluster to another cluster cluster” on page 470. Steps for submitting jobs in a LoadLeveler multicluster environment There are steps for submitting jobs in a LoadLeveler multicluster environment. | In a multicluster environment, you can specify one of the following: v That a job is to run on a particular cluster. v That LoadLeveler is to decide which cluster is best from the list of clusters, based on an administrator-defined metric. If any is specified, the job is submitted to the best cluster, based on an administrator-defined metric. | v That a job is a scale-across job which will run across multiple clusters The following procedure explains how to prepare your job to be submitted in the multicluster environment. Before you begin: You need to know that: v Only batch jobs are supported in the LoadLeveler multicluster environment. LoadLeveler will fail any interactive jobs that you attempt to submit in a multicluster environment. v LoadLeveler assigns all steps of a multistep job to the same cluster. v Job identifiers are assigned by the local cluster and are retained by the job regardless of what cluster the job executes in. v Remote jobs are subjected to the same configuration checks as locally submitted jobs. Examples include account validation, class limits, include lists, and exclude lists. Perform the following steps to submit jobs to run in one cluster in a LoadLeveler multicluster environment. 1. If files used by your job need to be copied between clusters, you must specify the job files to be copied from the local to the remote cluster in the job command file. Use the cluster_input_file and cluster_output_file keywords to specify these files. Rules: v Any local file specified for copy must be accessible from the local gateway Schedd machines. Input files must be readable. Directories and permissions must be in place to write output files. 224 TWS LoadLeveler: Using and Administering
  • 245. v Any remote file specified for copy must be accessible from the remote gateway Schedd machines. Directories and permissions must be in place to write input files. Output files must be readable when the job terminates. v To copy more than one file, these keywords can be specified multiple times. Tip: Each instance of these keywords allows you to specify a single local file and a single remote file. If your job requires copying multiple files (for example, all files in a directory), you may want to use a procedure to consolidate the multiple files into a single file rather than specify multiple cluster_file statements in the job command file. The following is an example of how you could consolidate input files: a. Use the tar command to produce a single tar file from multiple files. b. On the cluster_input_file keyword, specify the file that resulted from the tar command processing. c. Modify your job command file such that it uses the tar command to restore the multiple files from the tar file prior to invoking your application. 2. In the job command file, specify the clusters to which LoadLeveler may submit the job. The cluster_list keyword is a blank-delimited list of cluster names or the reserved word any where: v A single cluster name indicates that the job is to be submitted to that cluster. v A list of multiple cluster names indicates that the job is to be submitted to one of the clusters as determined by the installation exit CLUSTER_METRIC. v The reserved word any indicates that the job is to be submitted to any cluster defined by the installation exit CLUSTER_METRIC. Alternative: You can specify the clusters to which LoadLeveler can submit your job on the llsubmit command using the -X option. | 3. Use the llsubmit command to submit the job. | Tip: You may use the -X option on the llsubmit command to specify: | -X {cluster_list | any} | Is a blank-delimited list of cluster names or the reserved word any | where: | v A single cluster name indicates that the job is to be submitted to that | cluster. | v A list of multiple cluster names indicates that the job is to be | submitted to one of the clusters as determined by the installation exit | CLUSTER_METRIC. | v The reserved word any indicates that the job is to be submitted to | any cluster defined by the installation exit CLUSTER_METRIC. | Note: If a remote job is submitted with a list of clusters or the reserved word | any and the installation exit CLUSTER_METRIC is not specified, the | remote job is not submitted. Perform the following steps to submit scale-across jobs to run across multiple clusters in a multicluster environment: 1. In the job command file, specify the cluster_option keyword as scale_across. Alternative: You can submit a scale-across job using the -S option of the llsubmit command. 2. You can limit which clusters can be used to run the job by using the cluster_list keyword to specify the limited set of cluster. For a scale-across job, if the cluster_list keyword is not specified or the reserved word any is specified in the cluster_list, all clusters may be used to run the job. Alternative: You can limit which clusters can be used to run the scale-across job using the -X option of the llsubmit command. Chapter 8. Building and submitting jobs 225
  • 246. 3. Use the llsubmit command to submit the job from any cluster in the scale-across multicluster environment. The llsubmit command displays the assigned local outbound Schedd, the assigned remote inbound Schedd, the scheduling cluster and the job identifier when the remote job has been successfully submitted. Use the -q flag to stop these additional messages from being displayed. When you are done, you can use commands to display information about the submitted job; for example: v Use llq -l -X cluster_name -j job_id where cluster_name and job_id were displayed by the llsubmit command to display information about the remote job. v Use llq -l -X cluster_list to display the long listing about jobs, including scheduling cluster, submitting cluster, user-requested cluster, cluster input and output files. v Use llq -X all to display information about all jobs in all configured clusters. | v Use llq twice to display the job status for a scale-across job on all clusters where | the job has been distributed. In the first command, specify the -l option to | display the set of clusters where the job has been distributed (the value from the | Cluster List output line). The second time you run the command, specify the -X | option with the list of clusters reported from the first command. The result from | that command shows the job status on the other clusters. Submitting and monitoring Blue Gene jobs The following procedure explains how to prepare your job to be submitted to the Blue Gene system. The submission of Blue Gene jobs is similar to the submission of other job types. Before you begin: You need to know that checkpointing Blue Gene jobs is not currently supported. Tip: Use the llstatus command to check if Blue Gene support is enabled and whether Blue Gene is currently present. The llstatus command will display: The BACKFILL scheduler with Blue Gene support is in use Blue Gene is present when Blue Gene is support is enabled and Blue Gene is currently present Perform the following steps to submit Blue Gene jobs: 1. In the job command file, set the job type to Blue Gene by specifying: #@job_type = bluegene 2. Specify the size or shape of the Blue Gene job or the Blue Gene partition in which the job will run. v The size of the Blue Gene job can be specified by using the job command file keyword bg_size to specify the size of the job. For more information, see the detailed description of the bg_size keyword. v The shape of the Blue Gene job can be specified by using the job command file keyword bg_shape to specify the shape of the job. If you require the specific shape you specified, you may wish to specify the bg_rotate keyword to false. For more information on these keywords, see the detailed descriptions of the bg_shape keyword and bg_rotate keyword. 226 TWS LoadLeveler: Using and Administering
  • 247. v The partition in which the Blue Gene job is run can be specified using the bg_partition job command file keyword. For more information, see the detailed description of the bg_partition keyword. | v The size of a Blue Gene job refers to the number of Blue Gene compute | nodes instead of the number of tasks running on Startd machines. The | following keywords cannot be used to control the size of a Blue Gene job: | – node | – tasks_per_node | – total_tasks 3. Specify any other job command file keywords you require, including the bg_connection and bg_requirements Blue Gene job command file keywords. See “Job command file keyword descriptions” on page 359 for more information on job command file keywords. 4. Upon completing your job command file, submit the job using the llsubmit command. If you experience a problem submitting a Blue Gene job, see “Troubleshooting in a Blue Gene environment” on page 717 for common questions and answers pertaining to operations within a Blue Gene environment. When you are done, you can use the llq -b command to display information about Blue Gene jobs in short form. For more information see “llq - Query job status” on page 479. Example: The following is a sample job command file for a Blue Gene job: # @ job_name = bgsample # @ job_type = bluegene # @ comment = "BGL Job by Size" # @ error = $(job_name).err # @ output = $(job_name).out # @ environment = COPY_ALL; # @ wall_clock_limit = 200:00,200:00 # @ notification = always # @ notify_user = sam # @ bg_size = 1024 # @ bg_connection = torus # @ class = 2bp # @ queue /usr/bin/mpirun -exe /bgscratch/sam/com -verbose 2 -args "-o 100 -b 64 -r" Chapter 8. Building and submitting jobs 227
  • 248. 228 TWS LoadLeveler: Using and Administering
  • 249. Chapter 9. Managing submitted jobs This is a list of the tasks and sources of additional information for managing LoadLeveler jobs. Table 52 lists the tasks and sources of additional information for managing LoadLeveler jobs. Table 52. Roadmap of user tasks for managing submitted jobs To learn about: Read the following: Displaying information about v “Querying the status of a job” a submitted job or its v “Working with machines” on page 230 environment v “Displaying currently available resources” on page 230 v “llclass - Query class information” on page 433 v “llq - Query job status” on page 479 v “llstatus - Query machine status” on page 512 v “llsummary - Return job resource information for accounting” on page 535 Changing the priority of a v “Setting and changing the priority of a job” on page 230 submitted job v “llmodify - Change attributes of a submitted job step” on page 464 Changing the state of a v “Placing and releasing a hold on a job” on page 232 submitted job v “Canceling a job” on page 232 v “llhold - Hold or release a submitted job” on page 454 v “llcancel - Cancel a submitted job” on page 421 Checkpointing a submitted v “Checkpointing a job” on page 232 job v “llckpt - Checkpoint a running job step” on page 430 Querying the status of a job Once you submit a job, you can query the status of the job to determine, for example, if it is still in the queue or if it is running. You also receive other job status related information such as the job ID and the submitting user ID. You can query the status of a LoadLeveler job either by using the GUI or the llq command. For an example of querying the status of a job, see Chapter 10, “Example: Using commands to build, submit, and manage jobs,” on page 235. Querying the status of a job using a submit-only machine: In addition to allowing you to submit and cancel jobs, a submit-only machine allows you to query the status of jobs. You can query a job using either the submit-only version of the GUI or by using the llq command. For information on llq, see “llq - Query job status” on page 479. 229
  • 250. Working with machines There are types tasks related to machines. You can perform the following types of tasks related to machines: v Display machine status When you submit a job to a machine, the status of the machine automatically appears in the Machines window on the GUI. This window displays machine related information such as the names of the machines running jobs, as well as the machine’s architecture and operating system. For detailed information on one or more machines in the cluster, you can use the Details option on the Actions pull-down menu. This will provide you with a detailed report that includes information such as the machine’s state and amount of installed memory. For an example of displaying machine status, see Chapter 10, “Example: Using commands to build, submit, and manage jobs,” on page 235. v Display central manager The LoadLeveler administrator designates one of the machines in the LoadLeveler cluster as the central manager. When jobs are submitted to any machine, the central manager is notified and decides where to schedule the jobs. In addition, it keeps track of the status of machines in the cluster and jobs in the system by communicating with each machine. LoadLeveler uses this information to make the scheduling decisions and to respond to queries. Usually, the system administrator is more concerned about the location of the central manager than the typical end user but you may also want to determine its location. One reason why you might want to locate the central manager is if you want to browse some configuration files that are stored on the same machine as the central manager. v Display public scheduling machines Public scheduling machines are machines that participate in the scheduling of LoadLeveler jobs on behalf of users at submit-only machines and users at other workstations that are not running the Schedd daemon. You can find out the names of all these machines in the cluster. Submit-only machines allow machines that are not part of the LoadLeveler cluster to submit jobs to the cluster for processing. Displaying currently available resources The LoadLeveler user can get information about currently available resources by using the llstatus command with either the -F, or -R options. The -F option displays a list of all of the floating resources associated with the LoadLeveler cluster. The -R option lists all of the consumable resources associated with all of the machines in the LoadLeveler cluster. The user can specify a hostlist with the llstatus command to display only the consumable resources associated with specific hosts. Setting and changing the priority of a job LoadLeveler uses the priority of a job to determine its position among a list of all jobs waiting to be dispatched. 230 TWS LoadLeveler: Using and Administering
  • 251. LoadLeveler schedules jobs based on the adjusted system priority, which takes in account both system priority and user priority: User priority Every job has a user priority associated with it. A job with a higher priority runs before a job with a lower priority (when both jobs are owned by the same user). You can set this priority through the user_priority keyword in the job command file, and modify it through the llprio command. See “llprio - Change the user priority of submitted job steps” on page 477 for more information. System priority Every job has a system priority associated with it. Administrators can set this priority in the configuration file using the SYSPRIO keyword expression. The SYSPRIO expression can contain class, group, and user priorities, as shown in the following example: SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate) The SYSPRIO expression is evaluated by LoadLeveler to determine the overall system priority of a job. To determine which jobs to run first, LoadLeveler does the following: 1. Assigns a system priority value when the negotiator adds the new job to the queue of jobs eligible for dispatch. 2. Orders jobs first by system priority. 3. Assigns jobs belonging to the same user and the same class an adjusted system priority, which takes all the system priorities and orders them by user priority. Jobs with a higher adjusted system priority are scheduled ahead of jobs with a lower adjusted system priority. Only administrators may modify the system priority through the llmodify command with the -s option. See “llmodify - Change attributes of a submitted job step” on page 464 for more information. Example: How does a job’s priority affect dispatching order? To understand how a job’s priority affects dispatching order, consider the sample jobs which lists the priorities assigned to jobs submitted by two users, Rich and Joe. To understand how a job’s priority affects dispatching order, consider the sample jobs in Table 53, which lists the priorities assigned to jobs submitted by two users, Rich and Joe. Two of the jobs belong to Joe, and three belong to Rich. User Joe has two jobs (Joe1 and Joe2) in Class A with SYSPRIOs of 9 and 8 respectively. Since Joe2 has the higher user priority (20), and because both of Joe’s jobs are in the same class, Joe2’s priority is swapped with that of Joe1 when the adjusted system priority is calculated. This results in Joe2 getting an adjusted system priority of 9, and Joe1 getting an adjusted system priority of 8. Similarly, the Class A jobs belonging to Rich (Rich1 and Rich3) also have their priorities swapped. The priority of the job Rich2 does not change, since this job is in a different class (Class B). Table 53. How LoadLeveler handles job priorities System Priority Adjusted Job User Priority (SYSPRIO) Class System Priority Rich1 50 10 A 6 Chapter 9. Managing submitted jobs 231
  • 252. Table 53. How LoadLeveler handles job priorities (continued) System Priority Adjusted Job User Priority (SYSPRIO) Class System Priority Joe1 10 9 A 8 Joe2 20 8 A 9 Rich2 100 7 B 7 Rich3 90 6 A 10 Placing and releasing a hold on a job You may place a hold on a job and thereby cause the job to remain in the queue until you release it. There are two types of holds: a user hold and a system hold. Both you and your LoadLeveler administrator can place and release a user hold on a job. Only a LoadLeveler administrator, however, can place and release a system hold on a job. You can place a hold on a job or release the hold either by using the GUI or the llhold command. For examples of holding and releasing jobs, see Chapter 10, “Example: Using commands to build, submit, and manage jobs,” on page 235. As a user or an administrator, you can also use the startdate keyword to place a hold on a job. This keyword allows you to specify when you want to run a job. Canceling a job You can cancel one of your jobs that is either running or waiting to run by using either the GUI or the llcancel command. You can use llcancel to cancel LoadLeveler jobs, including jobs from a submit-only machine. For more information about the llcancel command, see “llcancel - Cancel a submitted job” on page 421. Checkpointing a job Checkpointing is a method of periodically saving the state of a job so that, if for some reason, the job does not complete, it can be restarted from the saved state. Checkpoints can be taken either under the control of the user application or external to the application. On AIX only, the LoadLeveler API ll_init_ckpt is used to initiate a serial checkpoint from the user application. For initiating checkpoints from within a parallel application, the API mpc_init_ckpt should be used. These APIs allow the writer of the application to determine at what points in the application it would be appropriate save the state of the job. To enable parallel applications to initiate checkpointing, you must use the APIs provided with the Parallel Environment (PE) program. For information on parallel checkpointing, see IBM Parallel Environment for AIX and Linux: Operation and Use, Volume 1. It is also possible to checkpoint a program running under LoadLeveler outside the control of the application. There are several ways to do this: v Use the llckpt command to initiate checkpoint for a specific job step. See “llckpt - Checkpoint a running job step” on page 430 for more information. 232 TWS LoadLeveler: Using and Administering
  • 253. v Checkpoint from a program which invokes the ll_ckpt API to initiate checkpoint of a specific job step. See “ll_ckpt subroutine” on page 550 for more information. v Have LoadLeveler automatically checkpoint all running jobs that have been enabled for checkpoint.To enable this automatic checkpoint, specify checkpoint = interval in the job command file. v As the result of an llctl flush command. Note: For interactive parallel jobs, the environment variable CHECKPOINT must be set to yes in the environment prior to starting the parallel application or the job will not be enabled for checkpoint. For more information see, IBM Parallel Environment for AIX and Linux: MPI Programming Guide. Chapter 9. Managing submitted jobs 233
  • 254. 234 TWS LoadLeveler: Using and Administering
  • 255. Chapter 10. Example: Using commands to build, submit, and manage jobs The following procedure presents a series of simple tasks that a user might perform using commands. For additional information about individual commands noted in the procedure, see Chapter 16, “Commands,” on page 411. 1. Build your job command file by using a text editor to create a script file. Into the file enter the name of the executable, other keywords designating such things as output locations for messages, and the necessary LoadLeveler statements, as shown in Figure 35: # This job command file is called longjob.cmd. The # executable is called longjob, the input file is longjob.in, # the output file is longjob.out, and the error file is # longjob.err. # # @ executable = longjob # @ input = longjob.in # @ output = longjob.out # @ error = longjob.err # @ queue Figure 35. Building a job command file 2. You can optionally edit the job command file you created in step 1. 3. To submit the job command file that you created in step 1, use the llsubmit command: llsubmit longjob.cmd LoadLeveler responds by issuing a message similar to: submit: The job "wizard.22" has been submitted. Where wizard is the name of the machine to which the job was submitted and 22 is the job identifier (ID). You may want to record the identifier for future use (although you can obtain this information later if necessary). 4. To display the status of the job you just submitted, use the llq command. This command returns information about all jobs in the LoadLeveler queue: llq wizard.22 Where wizard is the machine name to which you submitted the job, and 22 is the job ID. You can also query this job using the command llq wizard.22.0, where 0 is the step ID. 5. To change the priority of a job, use the llprio command. To increase the priority of the job you submitted by a value of 10, enter: llprio +10 wizard.22.0 You can change the user priority of a job that is in the queue or one that is running. This only affects jobs belonging to the same user and the same class. If you change the priority of a job in the queue, the job’s priority increases or decreases in relation to your other jobs in the queue. If you change the priority of a job that is running, it does not affect the job while it is running. It only 235
  • 256. affects the job if the job re-enters the queue to be dispatched again. For more information, see “Setting and changing the priority of a job” on page 230. 6. To place a temporary hold on a job in a queue, use the llhold command. This command only takes effect if jobs are in the Idle or NotQueued state. To place a hold on wizard.22.0, enter: llhold wizard.22.0 7. To release the hold you placed in step 6, use the llhold command: llhold -r wizard.22.0 8. To display the status of the machine to which you submitted a job, use the llstatus command: llstatus -l wizard 9. To cancel wizard.22.0, use the llcancel command: llcancel wizard.22.0 236 TWS LoadLeveler: Using and Administering
  • 257. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs | Note: This is the last release that will provide the Motif-based graphical user | interface xloadl. The function available in xloadl has been frozen since TWS | LoadLeveler 3.3.2. You do not have to perform the tasks in the order listed. You may perform certain tasks before others without any difficulty; however, some tasks must be performed prior to others for succeeding tasks to work. For example, you cannot submit a job if you do not have a job command file that you built using either the GUI or an editor. The tasks included in this topic are listed in Table 54. Table 54. User tasks available through the GUI Subtask Associated information (see...) Building and submitting v “Building jobs” jobs v “Editing the job command file” on page 249 v “Submitting a job command file” on page 250 Obtaining job status v “Displaying and refreshing job status” on page 251 v “Specifying which jobs appear in the Jobs window” on page 258 v “Sorting the Jobs window” on page 252 Managing a submitted job v “Changing the priority of your jobs” on page 253 v “Placing a job on hold” on page 253 v “Releasing the hold on a job” on page 253 v “Canceling a job” on page 254 Working with machines v “Displaying and refreshing machine status” on page 255 v “Specifying which machines appear in Machines window” on page 259 v “Sorting the Machines window” on page 257 v “Finding the location of the central manager” on page 257 v “Finding the location of the public scheduling machines” on page 258 Saving LoadLeveler “Saving LoadLeveler messages in a file” on page 259 messages in a file Building jobs Use these instructions when building jobs. From the Jobs window: SELECT File → Build a Job The dialog box shown in Figure 36 on page 238 appears: 237
  • 258. Figure 36. LoadLeveler build a job window Complete those fields for which you want to override what is currently specified in your skel.cmd defaults file. Sample skel.cmd and mcluster_skel.cmd files are found in the samples subdirectory of the 238 TWS LoadLeveler: Using and Administering
  • 259. release directory. You can update this file to define defaults for your site, and then update the *skelfile resource in Xloadl to point to your new skel.cmd file. If you want a personal defaults file, copy skel.cmd to one of your directories, edit the file, and update the *skelfile resource in .Xdefaults. Table 55 shows the fields displayed in the Build a Job window: Table 55. GUI fields and input Field Input Executable Name of the program to run. It must be an executable file. Optional. If omitted, the command file is executed as if it were a shell script. Arguments Parameters to pass to the program. Required only if the executable requires them. Stdin Filename to use as standard input (stdin) by the program. Optional. The default is /dev/null. Stdout Filename to use as standard output (stdout) by the program. Optional. The default is /dev/null. Stderr Filename to use as standard error (stderr) by the program. Optional. The default is /dev/null. Cluster Input File A comma delimited local and remote path name pair, representing the local file to copy to the remote location. If you have more than one pair to enter, the More button will display a Cluster Input Files input window. Optional. The default is no files are copied. Cluster Output A comma delimited local and remote path name pair, representing the File local file destination to copy to the remote file into. If you have more than one pair to enter, the More button will display a Cluster Output Files input window. Optional. The default is no files are copied. Initialdir Initial directory. LoadLeveler changes to this directory before running the job. Optional. The default is your current working directory. Notify User User id of person to notify regarding status of submitted job. Optional. The default is your userid. StartDate Month, day, and year in the format mm/dd/yyyy. The job will not start before this date. Optional. The default is to run the job as soon as possible. StartTime Hour, minute, second in the format hh:mm:ss. The job will not start before this time. Optional. The default is to run the job as soon as possible. If you specify StartTime but not StartDate, the default StartDate is the current day. If you specify StartDate but not StartTime, the default StartTime is 00:00:00. This means that the job will start as soon as possible on the specified date. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 239
  • 260. Table 55. GUI fields and input (continued) Field Input Priority Number between 0 and 100, inclusive. Optional. The default is 50. This is the user priority. For more information on this priority, refer to “Setting and changing the priority of a job” on page 230. Image size Number in kilobytes that reflects the maximum size you expect your program to grow to as it runs. Optional. Class Class name. The job will only run on machines that support the specified class name. Your system administrator defines the class names. Optional: v Press the Choices button to get a list of available classes. v Press the Details button under the class list to obtain long listing information about classes. Hold Hold status of the submitted job. Permitted values are: user User hold system System hold (only valid for LoadLeveler administrators) usersys User and system hold (only valid for LoadLeveler administrators) Note: The default is a no-hold state. Account Number Number associated with the job. For use with the llacctmrg and llsummary commands for acquiring job accounting data. Optional. Required only if the ACCT keyword is set to A_VALIDATE in the configuration file. Environment Your initial environment variables when your job starts. Separate environment specifications with semicolons. Optional. Copy All or Master, to indicate whether the environment variables specified in Environment the keyword Environment are copied to all nodes or just to the master node of a parallel job. Optional. Shell The name of the shell to use for the job. Optional. If not specified, the shell used in the owner’s password file entry is used. If none is specified, /bin/sh is used. Group The LoadLeveler group name to which the job belongs. Optional. Step Name The name of this job step. Optional. 240 TWS LoadLeveler: Using and Administering
  • 261. Table 55. GUI fields and input (continued) Field Input Node Usage How the node is used. Permitted values are: shared The node can be shared with other tasks of other job steps. This is the default. not shared The node cannot be shared. slice not shared Has the same meaning as not shared. It is provided for compatibility. Dependency A Boolean expression defining the relationship between the job steps. Optional. Large Page Whether or not the job step requires Large Page memory. yes Use Large Page memory if available, otherwise use regular memory. mandatory Use of Large Page memory is mandatory. no Do not use Large Page memory. Bulk Transfer Indicates to the communication subsystem whether it should use the bulk transfer mechanism to communicate between tasks. yes Use bulk transfer. no Do not use bulk transfer. Optional. Rset What type of RSet support is requested. Permitted values are: rset_mcm_affinity Requests scheduling affinity. Use the MCM options button to specify task allocation method, memory affinity preference or requirement, and adapter affinity preference or requirement. rset_name Requests a user defined RSet and nodes with rset_support set to rset_user_defined. Optional. Comments Comments associated with the job. These comments help to distinguish one job from another job. Optional. SMT Indicates whether a job requires dynamic simultaneous multithreading (SMT) function. yes The job requires SMT function. no The job does not require SMT function. as_is The SMT state will not be changed. Note: The fields that appear in this table are what you see when viewing the Build a Job window. The text in these fields does not necessarily correspond with the keywords listed in “Job command file keyword descriptions” on page 359. See “Job command file keyword descriptions” on page 359 for information on the defaults associated with these keywords. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 241
  • 262. SELECT A Job Type if you want to change the job type. Your choices are: Serial Specifies a serial job. This is the default. Parallel Specifies a parallel job. Blue Gene Specifies a bluegene job. MPICH Specifies a MPICH job. Note that the job type you select affects the choices that are active on the Build A Job window. SELECT a Notification option. Your choices are: Always Notify you when the job starts, completes, and if it incurs errors. Complete Notify you when the job completes. This is the default option as initially defined in the skel.cmd file. Error Notify you if the job cannot run because of an error. Never Do not notify you. Start Notify you when the job starts. SELECT a Restart option. Your choices are: No This job is not restartable. This is the default. Yes Restart the job. SELECT To restart the job on the same nodes from which it was vacated. Your choices are: No Restart the job on any available nodes. Yes Restart the job on the same nodes it ran on previously. This option is valid after a job has been vacated. Note that there is no default for the selection. SELECT a Checkpoint option. Your choices are: No Do not checkpoint the job. This is the default. Yes Yes, checkpoint the job at intervals you determine. See the checkpoint keyword for more information. Interval Yes, checkpoint the job at intervals determined by LoadLeveler. See the checkpoint keyword for more information. SELECT To start from a checkpoint file Your choices are: 242 TWS LoadLeveler: Using and Administering
  • 263. No Do not start the job from a checkpoint file (start job from beginning). Yes Yes, restart the job from an existing checkpoint file when you submit the job. The file name must be specified by the job command file. The directory name may be specified by the job command file, configuration file, or default location. SELECT Coschedule if you want steps within a job to be scheduled and dispatched at the same time. Your choices are: No Disables coscheduling for your job step. Yes Allows coscheduling to occur for your job step. Note: 1. This keyword is not inherited by other job steps. 2. The default is No. 3. The coscheduling function is only available with the BACKFILL scheduler. SELECT Nodes (available when the job type is parallel) The Nodes dialog box appears. Complete the necessary fields to specify node information for a parallel job (see Table 56). Depending upon which model you choose, different fields will be available; any unavailable fields will be desensitized. LoadLeveler will assign defaults for any fields that you leave blank. For more information, see the appropriate job command file keyword (listed in parentheses) in “Job command file keyword descriptions” on page 359. Table 56. Nodes dialog box Field Available in: Input Min # of Nodes Tasks Per Node Minimum number of nodes required for running the Model and Tasks parallel job (node keyword). with Uniform Blocking Model Optional. The default is one. Max # of Nodes Tasks Per Node Maximum number of nodes required for running the Model parallel job (node keyword). Optional. The default is the minimum number of nodes. Tasks per Node Tasks Per Node The number of tasks of the parallel job you want to Model run per node (tasks_per_node keyword). Optional. Total Tasks Tasks with The total number of tasks of the parallel job you Uniform Blocking want to run on all available nodes (total_tasks Model, and keyword). Custom Blocking Model Optional for Uniform, required for Custom Blocking. The default is one. Blocking Custom Blocking The number of tasks assigned (as a block) to each Model consecutive node until all of a job’s tasks have been assigned (blocking keyword) Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 243
  • 264. Table 56. Nodes dialog box (continued) Field Available in: Input Task Geometry Custom The task ids of each task that you want to run on Geometry Model each node. You can use the ″Set Geometry″ button for step-by-step directions (task_geometry keyword). SELECT Close to return to the Build a Job dialog box. SELECT Network (available when the job type is parallel) The Network dialog box appears. The Network dialog box consists of two parts: The top half of the panel is for MPI, and the bottom half is for LAPI. Click on the check box to the left of MPI or LAPI to activate the part of the panel for which you want to specify network information. If you want to use MPI with LAPI, click on both: v The MPI check box. v The check box for Share windows between MPI and LAPI. Complete those fields for which you want to specify network information (see Table 57). For more information, see the network keyword description in “Job command file keyword descriptions” on page 359. Table 57. Network dialog box fields Field Input MPI (MPI/LAPI) Select: v Only the MPI check box to use the Message Passing Interface (MPI) protocol only. v Both the MPI check box and the Share windows between MPI and LAPI check box to use both MPI and the Low-level Application Programming Interface (LAPI) protocols. This selection corresponds to setting the network keyword in the job command file to MPI_LAPI. Optional. LAPI Select the LAPI check box to use Low-level Application Programming Interface (LAPI) protocol only. Optional. Adapter/Network Select an adapter name or a network type from the list. Required for each protocol you select. Adapter Usage Specifies that the adapter is either shared or not shared. Optional. The default is shared. Communication Mode Specifies the communication subsystem mode used by the communication protocol that you specify and can be either IP (Internet Protocol) or US (User Space). Optional. The default is IP. Communication Level Implies the amount of memory to be allocated to each window for User Space mode. Allocation can be Low, Average, or High. It is ignored by Switch_Network_Interface_For_HPS adapters. 244 TWS LoadLeveler: Using and Administering
  • 265. Table 57. Network dialog box fields (continued) Field Input Instances Specifies the number of windows or IP addresses the communication subsystem should allocate to this protocol. Optional. The default is 1 unless sn_all is specified for network and then the default is max. rCxt Blocks The number of user rCxt blocks requested for each window used by the associated protocol. It is recognized only by Switch_Network_Interface_For_HPS adapters. Optional. SELECT Close to return to the Build a Job dialog box. SELECT Requirements The Requirements dialog box appears. Complete those fields for which you want to specify requirements (see Table 58). Defaults are used for those fields that you leave blank. LoadLeveler dispatches your job only to one of those machines with resources that matches the requirements you specify. Table 58. Build a job dialog box fields Field Input Architecture Machine type. The job will not run on any other machine type. (see note 2) Optional. The default is the architecture of your current machine. Operating System Operating system. The job will not run on any other operating system. (see note 2) Optional. The default is the operating system of your current machine. Disk Amount of disk space in the execute directory. The job will only run on a machine with at least this much disk space. Optional. The default is defined in your local configuration file. Memory Amount of memory. The job will only run on a machine with at least this much memory. Optional. The default is defined in your local configuration file. Large Page Amount of Large Page memory, in megabytes. The job step requires at Memory least this much Large Page memory to run. Optional. Total Memory Amount of total (regular and Large Page memory) in megabytes needed to run the job step. Optional. Machines Machine names. The job will only run on the specified machines. Optional. Features Features. The job will only run on machines with specified features. Optional. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 245
  • 266. Table 58. Build a job dialog box fields (continued) Field Input Pool Specifies the number associated with the pool you want to use. All available pools listed in the administration file appear as choices. The default is to select nodes from any pool. LoadLeveler Specifies the version of LoadLeveler, in dotted decimal format, on the Version machine where you want the job to run. For example: 3.3.0.0 specifies that your job will run on a machine running LoadLeveler Version 3.3.0.0 or higher. Optional. Connectivity A number from 0.0 through 1.0, representing the average connectedness of the node’s managed adapters. Requirement Requirements. The job will only run if these requirements are met. Note: 1. If you enter a resource that is not available, you will NOT receive a message. LoadLeveler holds your job in the Idle state until the resource becomes available. Therefore, make certain that the spelling of your entry is correct. You can issue llq -s jobID to find out if you have a job for which requirements were not met. 2. If you do not specify an architecture or operating system, LoadLeveler assumes that your job can run only on your machine’s architecture and operating system. If your job is not a shell script that can be run successfully on any platform, you should specify a required architecture and operating system. SELECT Close to return to the Build a Job dialog box. SELECT Resources The Resources dialog box appears. This dialog box allows you to set the amount of defined consumable resources required for a job step. Resources with an ″*″ appended to their names are not in the SCHEDULE_BY_RESOURCES list. For more information, see the resources keyword. SELECT Close to return to the Build a Job dialog box. SELECT Preferences The Preferences dialog box appears. This dialog box is similar to the Requirements dialog box, with the exception of the Adapter choice, which is not supported as a Preference. Complete the fields for those parameters that you want to specify. These parameters are not binding. For any preferences that you specify, LoadLeveler attempts to find a machine that matches these preferences along with your requirements. If it cannot find the machine, LoadLeveler chooses the first machine that matches the requirements. SELECT Close to return to the Build a Job dialog box. SELECT Limits 246 TWS LoadLeveler: Using and Administering
  • 267. The Limits dialog box appears. Complete the fields for those limits that you want to impose upon your job (see Table 59). If you type copy in any field except wall_clock_limit or job_cpu_limit, the limits in effect on the submit machine are used. If you leave any field blank, the default limits in effect for your userid on the machine that runs the job are used. For more information, see “Using limit keywords” on page 89. Table 59. Limits dialog box fields Field Input CPU Limit Maximum amount of CPU time that the submitted job can use. Express the amount as: [[hours:]minutes:]seconds[ .fraction] For example, 12:56:21 is 12 hours, 56 minutes, and 21 seconds. Optional Data Limit Maximum amount of the data segment that the submitted job can use. Express the amount as: integer[.fraction][units] Optional Core Limit Maximum size of a core file. Optional RSS Limit Maximum size of the resident set size. It is the largest amount of physical memory a user’s process can allocate. Optional File Limit Maximum size of a file that is created. Optional Stack Limit Maximum size of the stack. Optional Job CPU Limit Maximum total CPU time to be used by all processes of a serial job step or if a parallel job, then this is the total CPU time for each LoadL_starter process and its descendants for each job step of a parallel job. Optional Wall Clock Limit Maximum amount of elapsed time for which a job can run. Optional SELECT Close to return to the Build a Job dialog box. SELECT Checkpointing to specify checkpoint options (available when the checkpoint option is set to Yes or Interval) The checkpointing dialog box appears. Complete those fields for which you want to specify checkpoint information (see Table 60 on page 248). For detailed information on specific keywords, see “Job command file keyword descriptions” on page 359. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 247
  • 268. Table 60. Checkpointing dialog box fieldsF Field Input Ckpt File Specifies a checkpoint file. The serial default is : $(job_name).$(host).$(domain).$(jobid).$(stepid).ckpt Ckpt Directory Specifies a checkpoint directory name. Ckpt Execute Specifies a directory to use for staging the checkpoint executable file. Directory Ckpt Time Limits Sets the limits for the elapsed time a job can take checkpointing. SELECT Close to return to the Build a Job dialog box. SELECT Blue Gene (available when the job type is bluegene) The Blue Gene window appears. Complete the necessary fields to specify information for a Blue Gene job (see Table 61). Depending upon which request type you choose, different fields will be available; any unavailable fields will be desensitized. For more information, see the appropriate job command file keyword (listed in parentheses) in “Job command file keyword descriptions” on page 359. Table 61. Blue Gene job fields Field Available when Input requesting by: # of Compute Size The requested size in number of compute nodes that Nodes describes the size of the partition for this Blue Gene job. (bg_size) Shape Shape The requested shape of the requested Blue Gene job. The units of each dimension of the shape are in number of base partitions, XxYxZ, where X, Y, and Z are the number of base partitions in the X-direction, Y-direction, and Z-direction. (bg_shape) Partition Name Partition The name of an existing partition in the Blue Gene system where the requested job should run. (bg_partition) Connection Type Size and Shape The kinds of Blue Gene partitions that can be selected for this job. You can select Torus, Mesh, or Prefer Torus. (bg_connection) Optional. The default is Mesh. Rotate Shape Whether to consider all possible rotations of the Dimensions specified shape (True) or only the specified shape (False) when assigning a partition for the Blue Gene job. (bg_rotate) Optional. The default is True. 248 TWS LoadLeveler: Using and Administering
  • 269. Table 61. Blue Gene job fields (continued) Field Available when Input requesting by: Memory Megabytes A number (in megabytes) that represents the minimum available virtual memory that is needed to run the job. LoadLeveler generates a Blue Gene requirement that specifies memory that is greater than or equal to the amount you specify. Optional. If you leave this field blank, this parameter is not used when searching for machines to run your job. Requirements Expression An expression that specifies the Blue Gene requirements that a machine must meet in order to run the job. Memory is the supported keyword. SELECT Close to return to the Build a Job dialog box. Editing the job command file Use these instructions to edit the job command file that you just built. There are several ways that you can edit the job command file that you just built: 1. Using the Jobs window: SELECT File → Submit a Job The Submit a Job dialog box appears. SELECT The job file you want to edit from the file column. SELECT Edit Your job command file appears in a window. You can use any editor to edit the job command file. The default editor is specified in your .Xdefaults file. If you have an icon manager, an icon may appear. An icon manager is a program that creates a graphic symbol, displayed on a screen, that you can point to with a device such as a mouse in order to select a particular function or application. Select this icon to view your job command file. 2. Using the Tools Edit pull-down menus on the Build a Job window: Using the Edit pull-down menu, you can modify the job command file. Your choices appear in the Table 62: Table 62. Modifying the job command file with the Edit pull-down menu To Select Add a step to the job command file Add a Step or Add a First Step Delete a step from the job command file Delete a Step Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 249
  • 270. Table 62. Modifying the job command file with the Edit pull-down menu (continued) To Select Clear the fields in the Build a Job window Clear Fields Select defaults to use in the fields Set Field Defaults Note: Other options include Go to Next Step, Go to Previous Step, and Go to Last Step that allow you to edit various steps in the job command file. Using the Tools pull-down menu, you can modify the job command file. Your choices appear in Table 63: Table 63. Modifying the job command file with the Tools pull-down menu To Select Name the job Set Job Name Specify a cluster, cluster list, or any cluster, if a multicluster Set Cluster environment is configured. Open a window where you can enter a script file Append Script Fill in the fields using another file Restore from File View the job command file in a window View Entire Job Determine which step you are viewing What is step # Start a new job command file Start a new job You can save and submit the information you entered by selecting the choices shown in Table 64: Table 64. Saving and submitting information To Do This Save the information you SELECT entered into a file which you Save can submit later A window appears prompting you to enter a job filename. ENTER a job filename in the text entry field. SELECT OK The window closes and the information you entered is saved in the file you specified. Submit the program SELECT immediately and discard the Submit information you entered Submitting a job command file After building a job command file, you can submit it to one or more machines for processing. To submit a job, from the Jobs window: SELECT File → Submit a Job 250 TWS LoadLeveler: Using and Administering
  • 271. The Submit a Job dialog box appears. SELECT The job file that you want to submit from the file column. You can also use the filter field and the directories column to select the file or you can type in the file name in the text entry field. SELECT Submit The job is submitted for processing. You can now submit another job or you can press Close to exit the window. Displaying and refreshing job status When you submit a job, the status of the job is automatically displayed in the Jobs window. You can update or refresh this status using the Jobs window and selecting one of the following: v Refresh → Refresh Jobs v Refresh → Refresh All. To change how often the amount of time should pass before the jobs window is automatically refreshed, use the Jobs window. SELECT Refresh → Set Auto Refresh A window appears. TYPE IN a value for the number of seconds to pass before the Jobs window is updated. Automatic refresh can be expensive in terms of network usage and CPU cycles. You should specify a refresh interval of 120 seconds or more for normal use. SELECT OK The window closes and the value you specified takes effect. To receive detailed information on a job: SELECT Actions → Extended Status to receive additional information on the job. Selecting this option is the same as typing llq -x command. You can also get information in the following way: SELECT Actions → Extended Details Selecting this option is the same as typing llq -x -l command. You can also double click on the job in the Jobs window to get details on the job. Note: Obtaining extended status or details on multiple jobs can be expensive in terms of network usage and CPU cycles. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 251
  • 272. SELECT Actions → Job Status You can also use the llq -s command to determine why a submitted job remains in the Idle or Deferred state. SELECT Actions → Resource Use Allows you to display resource use for running jobs. Selecting this option is the same as entering the llq -w command. SELECT Actions → Blue Gene Job Status Allows you to display Blue Gene job information for jobs. Selecting this option is the same as entering the llq -b command. For more information on requests for job information, see “llq - Query job status” on page 479. Sorting the Jobs window You can specify up to two sorting options for the Jobs window. The options you specify determine the order in which the jobs appear in the Jobs window. From the Jobs window: Select Sort → Set Sort Parameters A window appears Select A primary and secondary sort Table 65 lists the sorting options: Table 65. Sorting the jobs window To: Select Sort Sort jobs by the machine from which they were Sort by Submitting Machine submitted Sort by owner Sort by Owner Sort by the time the jobs were submitted Sort by Submission Time Sort by the state of the job Sort by State Sort jobs by their user priority (last job listed runs first) Sort by Priority Sort by the class of the job Sort by Class Sort by the group associated with the job Sort by Group Sort by the machine running the job Sort by Running Machine Sort by dispatch order Sort by Dispatch Order Not specify a sort No Sort You can select a sort type as either a Primary or Secondary sorting option. For example, suppose you select Sort by Owner as the primary sorting option and Sort by Class as the secondary sorting option. The Jobs window is sorted by owner and, within each owner, by class. 252 TWS LoadLeveler: Using and Administering
  • 273. Changing the priority of your jobs If your job has not yet begun to run and is still in the queue, you can change the priority of the job in relation to your other jobs in the queue that belong to the same class. This only affects the user priority of the job. For more information on this priority, refer to “Setting and changing the priority of a job” on page 230. Only the owner of a job or the LoadLeveler administrator can change the priority of a job. From the Jobs window: SELECT a job by clicking on it with the mouse SELECT Actions → Priority A window appears. TYPE IN a number between 0 and 100, inclusive, to indicate a new priority. SELECT OK The window closes and the priority of your job changes. Placing a job on hold Only the owner of a job or the LoadLeveler administrator can place a hold on a job. From the Jobs window: SELECT The job you want to hold by clicking on it with the mouse SELECT Actions → Hold The job is put on hold and its status changes in the Jobs window. Releasing the hold on a job Only the owner of a job or the LoadLeveler administrator can release a hold on a job. From the Jobs window: SELECT The job you want to release by clicking on it with the mouse SELECT Actions → Release from Hold The job is released from hold and its status is updated in the Jobs window. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 253
  • 274. Canceling a job Only the owner of a job or the LoadLeveler administrator can cancel a job. From the Jobs window: SELECT The job you want to cancel by clicking on it with the mouse SELECT Actions → Cancel LoadLeveler cancels the job and the job information disappears from the Jobs window. Modifying consumable resources and other job attributes Use these commands to modify the consumable CPUs or memory requirements of a nonrunning job. SELECT Modify → Consumable CPUs or Modify → Consumable Memory or Modify → Class or Modify → Account number or Modify → Blue Gene → Connection or Modify → Blue Gene → Partition or Modify → Blue Gene → Rotate or Modify → Blue Gene → Shape or Modify → Blue Gene → Size or Modify → Blue Gene → Requirement A dialog box appears prompting you to enter a new value for the selected job attribute. Blue Gene attributes are available when Blue Gene is enabled. TYPE IN The new value SELECT OK The dialog box closes and the value you specified takes effect. Taking a checkpoint Use these commands to checkpoint the selected job. 254 TWS LoadLeveler: Using and Administering
  • 275. SELECT One of the following actions to take when checkpoint has completed: v Continue the step v Terminate the step v Hold the step A checkpoint monitor for this step appears. Adding a job to a reservation Use these commands to bind selected job steps to a reservation so that they will only be scheduled to run on the nodes reserved for the reservation. SELECT The job you want to bind by clicking on it with the mouse. SELECT Actions → Bind to Reservation A window appears. SELECT A reservation from the list. SELECT OK The window closes and the job is bound to that reservation. Removing a job from a reservation Use these commands to unbind selected job steps from reservations to which they currently belong. SELECT The job you want to unbind by clicking on it with the mouse. SELECT Actions → Unbind from Reservation If the job is bound to a reservation, it is removed from the reservation. Displaying and refreshing machine status The status of the machines is automatically displayed in the Machines window. You can update or refresh this status using the Machines window and selecting one of the following: v Refresh → Refresh Machines v Refresh → Refresh All. To specify an amount of time to pass before the Machines window is automatically refreshed, from the Machines window: SELECT Refresh → Set Auto Refresh A window appears. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 255
  • 276. TYPE IN a value for the number of seconds to pass before the Machines window is updated. Automatic refresh can be expensive in terms of network usage and CPU cycles. You should specify a refresh interval of 120 seconds or more for normal use. SELECT OK The window closes and the value you specified takes effect. To receive detailed information on a machine: SELECT Actions → Details This displays status information about the selected machines. Selecting this option has the same effect as typing the llstatus -l command SELECT Actions → Adapter Details This displays virtual and physical adapter information for each selected machine. Selecting this option has the same effect as typing the llstatus -a command SELECT Actions → Floating Resources This displays consumable resources for the LoadLeveler cluster. Selecting this option has the same effect as typing the llstatus -R command SELECT Actions → Machine Resources This displays consumable resources defined for the selected machines or all machines. Selecting this option has the same effect as typing the llstatus -R command SELECT Actions → Cluster Status This displays status of machines in the defined cluster or clusters. It appears only when a multicluster environment is configured and is equivalent to the llstatus -X all command. SELECT Actions → Cluster Config This displays cluster information from the LoadL_admin file. Only fields with data specified or which have defaults when not specified are displayed. It appears only when a multicluster environment is configured and is equivalent to the llstatus -C command. SELECT Actions → Blue Gene ... This displays information about the Blue Gene system. You can select the option for Status for a short listing, Details for a long listing, Base Partitions for Blue Gene base partition status, or Partitions for existing 256 TWS LoadLeveler: Using and Administering
  • 277. Blue Gene partition status. It is available only when Blue Gene support is enabled in LoadLeveler. This is equivalent to the llstatus command with the options -b, -b -l, -B, or -P. Sorting the Machines window You can specify up to two sorting options for the Machines window. The options you specify determine the order in which machines appear in the window. From the Machines window: Select Sort → Set Sort Parameters A window appears Select A primary and secondary sort Table 66 lists sorting options for the Machines window: Table 66. Sorting the machines window To: Select Sort → Sort by machine name Sort by Name Sort by Schedd state Sort by Schedd Sort by total number of jobs scheduled Sort by InQ Sort by number of running jobs scheduled by this machine Sort by Act Sort by startd state Sort by Startd Sort by the number of jobs running on this machine Sort by Run Sort by load average Sort by LdAvg Sort by keyboard idle time Sort by Idle Sort by hardware architecture Sort by Arch Sort by operating system type Sort by OpSys Not specify a sort No Sort You can select a sort type as either a Primary or Secondary sorting option. For example, suppose you select Sort by Arch as the primary sorting option and Sort by Name as the secondary sorting option. The Machines window is sorted by hardware architecture, and within each architecture type, by machine name. Finding the location of the central manager The LoadLeveler administrator designates one of the nodes in the LoadLeveler cluster as the central manager. When jobs are submitted at any node, the central manager is notified and decides where to schedule the jobs. In addition, it keeps track of the status of machines in the cluster and the jobs in the system by communicating with each node. LoadLeveler uses this information to make the scheduling decisions and to respond to queries. To find the location of the central manager, from the Machines window: Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 257
  • 278. SELECT Actions → Find Central Manager A message appears in the message window declaring on which machine the central manager is located. Finding the location of the public scheduling machines Public scheduling machines are those machines that participate in the scheduling of LoadLeveler jobs on behalf of the submit-only machines. To get a list of these machines in your cluster, use the Machines window: SELECT Actions → Find Public Scheduler A message appears displaying the names of these machines. Finding the type of scheduler in use The LoadLeveler administrator defines the scheduler used by the cluster. To determine which scheduler is currently in use: SELECT Actions → Find Scheduler Type A message appears displaying the type: v ll_default v BACKFILL v External (API) Specifying which jobs appear in the Jobs window Normally, only your jobs appear in the Jobs window. You can, however, specify which jobs you want to appear by using the Select pull-down menu on the Jobs window (see Table 67). Table 67. Specifying which jobs appear in the Jobs window To Display Select Select → All jobs in the queue All All jobs belonging to a specific By User user (or users) A window appears prompting you to enter the user IDs whose jobs you want to view. All jobs submitted to a specific By Machine machine (or machines) A window appears prompting you to enter the machine names on which the jobs you want to view are running. All jobs belonging to a specific By Group group (or groups) A window appears prompting you to enter the LoadLeveler group names to which the jobs you want to view belong. 258 TWS LoadLeveler: Using and Administering
  • 279. Table 67. Specifying which jobs appear in the Jobs window (continued) To Display Select Select → All jobs having a particular ID By Job Id A dialog box prompts you to enter the id of the job you want to appear. This ID appears in the left column of the Jobs window. Type in the ID and press OK. Note: When you choose By User, By Machines, or By Group, you can use a UNIX regular expression enclosed in parenthesis. For example, you can enter (^k10) to display all machines beginning with the characters “k10”. SELECT Select → Show Selection to show the selection parameters. Specifying which machines appear in Machines window You can specify which machines will appear in the Machines window. See Table 68. The default is to view all of the machines in the LoadLeveler pool. From the Machines window: Table 68. Specifying which machines appear in Machines window To Select Select → View all of the machines All View machines by operating by OpSys system A window appears prompting you to enter the operating system of those machines you want to view. View machines by hardware by Arch architecture A window appears prompting you to enter the hardware architecture of those machines you want to view. View machines by state by State A cascading pull-down menu appears prompting you to select the state of the machines that you want to view. SELECT Select → Show Selection to show the selection parameters. Saving LoadLeveler messages in a file Normally, all the messages that LoadLeveler generates appear in the Messages window. If you would also like to have these messages written to a file, use the Messages window. SELECT Actions → Start logging to a file A window appears prompting you to enter a filename in which to log the messages. Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 259
  • 280. TYPE IN The filename in the text entry field. SELECT OK The window closes. 260 TWS LoadLeveler: Using and Administering
  • 281. Part 4. TWS LoadLeveler interfaces reference The topics in the TWS LoadLeveler interfaces reference provide the details you need to know to correctly use the IBM Tivoli Workload Scheduler (TWS) LoadLeveler interfaces for the following tasks: v Specifying keywords in the TWS LoadLeveler control files v Starting and customizing the TWS LoadLeveler GUI v Correctly coding the TWS LoadLeveler commands and APIs 261
  • 282. 262 TWS LoadLeveler: Using and Administering
  • 283. Chapter 12. Configuration file reference The configuration file contains many parameters that you can set or modify to control how LoadLeveler operates. You may control LoadLeveler’s operation either: v Across the cluster, by modifying the global configuration file, LoadL_config, or v Locally, by modifying the LoadL_config.local file on individual machines. Table 69 shows the configuration subtasks: Table 69. Configuration subtasks Subtask Associated information (see . . . ) To find out what administrator tasks Chapter 4, “Configuring the LoadLeveler you can accomplish by using the environment,” on page 41 configuration file To learn how to correctly specify the v “Configuration file syntax” contents of a configuration file v “Configuration file keyword descriptions” on page 265 v “User-defined keywords” on page 313 v “LoadLeveler variables” on page 314 Configuration file syntax The information in both the LoadL_config and the LoadL_config.local files is in the form of a statement. These statements are made up of keywords and values. There are three types of configuration file keywords: v Keywords, described in “Configuration file keyword descriptions” on page 265. v User-defined variables, described in “User-defined keywords” on page 313. v LoadLeveler variables, described in “LoadLeveler variables” on page 314. Configuration file statements take one of the following formats: keyword=value keyword:value Statements in the form keyword=value are used primarily to customize an environment. Statements in the form keyword:value are used by LoadLeveler to characterize the machine and are known as part of the machine description. Every machine in LoadLeveler has its own machine description which is read by the central manager when LoadLeveler is started. Keywords are not case sensitive. This means you can enter them in lower case, upper case, or mixed case. Note: For the keyword=value form, if the keyword is of a boolean type and only true and false are valid input, a value string starting with t or T is taken as true; all other values are taken as false. To continue configuration file statements, use the back-slash character (). 263
  • 284. In the configuration file, comments must be on a separate line from keyword statements. You can use the following types of constants and operators in the configuration file. Numerical and alphabetical constants These are the numerical and alphabetical constants. Constants may be represented as: v Boolean expressions v Signed integers v Floating point values v Strings enclosed in double quotes (″ ″). Mathematical operators You can use the following C operators. The operators are listed in order of precedence. All of these operators are evaluated from left to right: v ! v */ v -+ v < <= > >= v == != v && v || 64-bit support for configuration file keywords and expressions Administrators can assign 64-bit integer values to selected keywords in the configuration file. floating_resources Consumable resources associated with the floating_resources keyword may be assigned 64-bit integer values. Fractional and unit specifications are not | allowed. The predefined ConsumableCpus, ConsumableMemory, | ConsumableLargePageMemory, and ConsumableVirtualMemory may not be | specified as floating resources. Example: floating_resources = spice2g6(9876543210123) db2_license(1234567890) MACHPRIO expression | The LoadLeveler variables: Disk, ConsumableCpus, ConsumableMemory, | ConsumableVirtualMemory, ConsumableLargePageMemory, PagesScanned, | Memory, VirtualMemory, FreeRealMemory, and PagesFreed may be used in a | MACHPRIO expression. They are 64-bit integers and 64-bit arithmetic is used to evaluate them. Example: MACHPRIO: (Memory + FreeRealMemory) - (LoadAvg*1000 + PagesScanned) 264 TWS LoadLeveler: Using and Administering
  • 285. Configuration file keyword descriptions This topic provides an alphabetical list of the keywords you can use in a LoadLeveler configuration file. It also provides examples of statements that use these keywords. ACCT Turns the accounting function on or off. Syntax: ACCT = flag ... The available flags are: A_DETAIL Enables extended accounting. Using this flag causes LoadLeveler to record detail resource consumption by machine and by events for each job step. This flag also enables the -x flag of the llq command, permitting users to view resource consumption for active jobs. A_RES Turns reservation data recording on. A_OFF Turns accounting data recording off. A_ON Turns accounting data recording on. If specified without the A_DETAIL flag, the following is recorded: v The total amount of CPU time consumed by the entire job v The maximum memory consumption of all tasks (or nodes). A_VALIDATE Turns account validation on. Default value: A_OFF Example: This example specifies that accounting should be turned on and that extended accounting data should be collected and that the -x flag of the llq command be enabled. ACCT = A_ON A_DETAIL ACCT_VALIDATION Identifies the executable called to perform account validation. Syntax: ACCT_VALIDATION = program Where program is a validation program. Default value: $(BIN)/llacctval (the accounting validation program shipped with LoadLeveler. ACTION_ON_MAX_REJECT Specifies the state in which jobs are placed when their rejection count has reached the value of the MAX_JOB_REJECT keyword. HOLD specifies that jobs are placed in User Hold status; SYSHOLD specifies that jobs are placed in System Hold status; CANCEL specifies that jobs are canceled. When a job is rejected, LoadLeveler sends a mail message stating why the job was rejected. Syntax: ACTION_ON_MAX_REJECT = HOLD | SYSHOLD | CANCEL Chapter 12. Configuration file reference 265
  • 286. Default value: HOLD ACTION_ON_SWITCH_TABLE_ERROR Points to an administrator supplied program that will be run when DRAIN_ON_SWITCH_TABLE_ERROR is set to true and a switch table unload error occurs. Syntax: ACTION_ON_SWITCH_TABLE_ERROR = program Default value: The default is to not run a program. ADMIN_FILE Points to the administration file containing user, class, group, machine, and adapter stanzas. Syntax: ADMIN_FILE = directory Default value: $(tilde)/admin_file AFS_GETNEWTOKEN Specifies a filter that, for example, can be used to refresh an AFS token. Syntax: AFS_GETNEWTOKEN = full_path_to_executable Where full_path_to_executable is an administrator-supplied program that receives the AFS authentication information on standard input and writes the new information to standard output. The filter is run when the job is scheduled to run and can be used to refresh a token which expired when the job was queued. Default value: The default is to not run a program. AGGREGATE_ADAPTERS Allows an external scheduler to specify per-window adapter usages. Syntax: AGGREGATE_ADAPTERS = YES | NO When this keyword is set to YES, the resources from multiple switch adapters on the same switch network are treated as one aggregate pool available to each job. When this keyword is set to NO, the switch adapters are treated individually and a job cannot use resources from multiple adapters on the same network. Set this keyword to NO when you are using an external scheduler; otherwise, set to YES (or accept the default). Default value: YES | ALLOC_EXCLUSIVE_CPU_PER_JOB | Specifies the way CPU affinity is enforced on Linux platforms. When this | keyword is not specified or when an unrecognized value is assigned to it, | LoadLeveler will not attempt to set CPU affinity for any application processes | spawned by it. | Note: This keyword is valid only on Linux x86 and x86_64 platforms. This | keyword is ignored by LoadLeveler on all other platforms. | The ALLOC_EXCLUSIVE_CPU_PER_JOB keyword can be specified in the | global or local configuration files. It can also be specified in both configuration 266 TWS LoadLeveler: Using and Administering
  • 287. | files, in which case the setting in the local configuration file will override that | of the global configuration file. The keyword cannot be turned off in a local | configuration file if it has been set to any value in the global configuration file. | Changes to ALLOC_EXCLUSIVE_CPU_PER_JOB will not take effect at | reconfiguration. The administrator must stop and restart or recycle | LoadLeveler when changing ALLOC_EXCLUSIVE_CPU_PER_JOB. | Syntax: | ALLOC_EXCLUSIVE_CPU_PER_JOB = LOGICAL|PHYSICAL | Default value: By default, when this keyword is not specified, CPU affinity is | not set. | Example: When the value of this keyword is set to LOGICAL, only one | LoadLeveler job step will run on each of the processors available on the | machine: | ALLOC_EXCLUSIVE_CPU_PER_JOB = LOGICAL | Example: When the value of this keyword is set to PHYSICAL, all logical | processors (or physical cores) configured in one physical CPU package will be | allocated to one and only one LoadLeveler job step. | ALLOC_EXCLUSIVE_CPU_PER_JOB = PHYSICAL ARCH Indicates the standard architecture of the system. The architecture you specify here must be specified in the same format in the requirements and preferences statements in job command files. The administrator defines the character string for each architecture. Syntax: ARCH = string Default value: Use the command llstatus -l to view the default. Example: To define a machine as an RS/6000®, the keyword would look like: ARCH = R6000 BG_ALLOW_LL_JOBS_ONLY Specifies if only jobs submitted through LoadLeveler will be accepted by the Blue Gene job launcher program. Syntax: BG_ALLOW_LL_JOBS_ONLY = true | false Default value: false BG_CACHE_PARTITIONS Specifies whether allocated partitions are to be reused for Blue Gene jobs whenever possible. Syntax: BG_CACHE_PARTITIONS = true | false Default value: true BG_ENABLED Specifies whether Blue Gene support is enabled. Syntax: BG_ENABLED = true | false Chapter 12. Configuration file reference 267
  • 288. If the value of this keyword is true, the central manager will load the Blue Gene control system libraries and query the state of the Blue Gene system so that jobs of type bluegene can be scheduled. Default value: false BG_MIN_PARTITION_SIZE Specifies the smallest number of compute nodes in a partition. Syntax: BG_MIN_PARTITION_SIZE = 32 | 128 | 512 (for Blue Gene/L) BG_MIN_PARTITION_SIZE = 16 | 32 | 64 | 128 | 256 | 512 (for Blue Gene/P) The value for this keyword must not be smaller than the minimum partition size supported by the physical Blue Gene hardware. If the number of compute nodes requested in a job is less than the minimum partition size, LoadLeveler will increase the requested size to the minimum partition size. If the max_psets_per_bp value is set in the DB_PROPERTY file, the value for the BG_MIN_PARTITION_SIZE must be set as described in Table 70: Table 70. BG_MIN_PARTITION_SIZE values max_psets_per_bp value in BG_MIN_PARTITION_SIZE for BG_MIN_PARTITION_SIZE for DB_PROPERTY file Blue Gene/L Blue Gene/P 4 >= 128 >= 128 8 >= 128 >= 64 16 >= 32 >= 32 32 >= 32 >= 16 Default value: 32 BIN Defines the directory where LoadLeveler binaries are kept. Syntax: BIN = $(RELEASEDIR)/bin Default value: $(tilde)/bin CENTRAL_MANAGER_HEARTBEAT_INTERVAL Specifies the amount of time, in seconds, that defines how frequently primary and alternate central manager communicate with each other. Syntax: CENTRAL_MANAGER_HEARTBEAT_INTERVAL = number Default value: The default is 300 seconds or 5 minutes. CENTRAL_MANAGER_TIMEOUT Specifies the number of heartbeat intervals that an alternate central manager will wait before declaring that the primary central manager is not operating. Syntax: CENTRAL_MANAGER_TIMEOUT = number Default value: The default is 6. CKPT_CLEANUP_INTERVAL Specifies the interval, in seconds, at which the Schedd daemon will run the program specified by the CKPT_CLEANUP_PROGRAM keyword. 268 TWS LoadLeveler: Using and Administering
  • 289. Syntax: CKPT_CLEANUP_INTERVAL = number number must be a positive integer. Default value: -1 CKPT_CLEANUP_PROGRAM Identifies an administrator-provided program which is to be run at the interval specified by the ckpt_cleanup_interval keyword. The intent of this program is to delete old checkpoint files created by jobs running under LoadLeveler during the checkpoint process. Syntax: CKPT_CLEANUP_PROGRAM = program Where program is the fully qualified name of the program to be run. The program must be accessible and executable by LoadLeveler. A sample program to remove checkpoint files is provided in the /usr/lpp/LoadL/full/samples/llckpt/rmckptfiles.c file. Default value: No default value is set. CKPT_EXECUTE_DIR Specifies the directory where the job step’s executable will be saved for checkpointable jobs. You can specify this keyword in either the configuration file or the job command file; different file permissions are required depending on where this keyword is set. For additional information, see “Planning considerations for checkpointing jobs” on page 140. Syntax: CKPT_EXECUTE_DIR = directory This directory cannot be the same as the current location of the executable file, or LoadLeveler will not stage the executable. In this case, the user must have execute permission for the current executable file. Default value: By default, the executable of a checkpointable job step is not staged. CLASS Determines whether a machine will accept jobs of a certain job class. For parallel jobs, you must define a class instance for each task you want to run on a node using one of two formats: v The format, CLASS = class_name (count), defines the CLASS names using a statement that names the classes and sets the number of tasks for each class in parenthesis. With this format, the following rules apply: – Each class can have only one entry – If a class has more than one entry or there is a syntax error, the entire CLASS statement will be ignored – If the CLASS statement has a blank value or is not specified, it will be defaulted to No_Class (1) – The number of instances for a class specified inside the parenthesis ( ) must be an unsigned integer. If the number specified is 0, it is correct syntactically, but the class will not be defined in LoadLeveler – If the number of instances for all classes in the CLASS statement are 0, the default No_Class(1) will be used Chapter 12. Configuration file reference 269
  • 290. v The format, CLASS = { ″class1″ ″class2″ ″class2″ ″class2″}, defines the CLASS names using a statement that names each class and sets the number of tasks for each class based on the number of times that the class name is used inside the {} operands. Note: With both formats, the class names list is blank delimited. For a LoadLeveler job to run on a machine, the machine must have a vacancy for the class of that job. If the machine is configured for only one No_Class job and a LoadLeveler job is already running there, then no further LoadLeveler jobs are started on that machine until the current job completes. | You can have a maximum of 1024 characters in the class statement. You cannot | use allclasses or data_stage as a class name, since these are reserved | LoadLeveler keywords. You can assign multiple classes to the same machine by specifying the classes in the LoadLeveler configuration file (called LoadL_config) or in the local configuration file (called LoadL_config.local). The classes, themselves, should be defined in the administration file. See “Setting up a single machine to have multiple job classes” on page 723 and “Defining classes” on page 89 for more information on classes. Syntax: CLASS = { "class_name" ... } | {"No_Class"} | class_name (count) ... Default value: {″No_Class″} CLIENT_TIMEOUT Specifies the maximum time, in seconds, that a daemon waits for a response over TCP/IP from a process. If the waiting time exceeds the specified amount, the daemon tries again to communicate with the process. In general, you should use the default setting unless you are experiencing delays due to an excessively loaded network. If so, you should try increasing this value. Syntax: CLIENT_TIMEOUT = number Default value: The default is 30 seconds. CLUSTER_METRIC Indicates the installation exit to be run by the Schedd to determine where a remote job is distributed. If a remote job is submitted with a list of clusters or the reserved word any and the installation exit is not specified, the remote job is not submitted. Syntax: CLUSTER_METRIC = full_pathname_to_executable The installation exit is run with the following parameters passed as input. All parameters are character strings. v The job ID of the job to be distributed v The number of clusters in the list of clusters v A blank-delimited list of clusters to be considered If the user specifies the reserved word any as the cluster_list during job submission, the job is sent to the first outbound Schedd defined for the first configured remote cluster. The CLUSTER_METRIC is executed on this machine to determine where the job will be distributed. If this machine is not the outbound_hosts Schedd for the assigned cluster, the job will be forwarded 270 TWS LoadLeveler: Using and Administering
  • 291. to the correct outbound_hosts Schedd. If the user specifies a list of clusters as the cluster_list during job submission, the job is sent to the first outbound Schedd defined for the first specified remote cluster. The CLUSTER_METRIC is executed on this machine to determine where the job will be distributed. If this machine is not the outbound_hosts Schedd for the assigned cluster, the job will be forwarded to the correct outbound_hosts Schedd. Note: The list of clusters may contain a single entry of the reserved word any, which indicates that the CLUSTER_METRIC installation exit must determine its own list of clusters to select from. This can be all of the clusters available using the data access API or a predetermined list set by the administrator. If any is specified in place of a cluster list, the metric will receive a count of 1 followed by the keyword any. The installation exit must write the remote cluster name to which the job is submitted as standard output and exit with a value of 0. An exit value of -1 indicates an error in determining the cluster for distribution and the job is not submitted. Returned cluster names that are not valid also cause the job to be not submitted. STDERR from the exit is written to the Schedd log. LoadLeveler provides a set of sample exits for use in distributing jobs by the following metrics: v The number of jobs in the idle queue v The number of jobs in the specified class v The number of free nodes in the cluster The installation exit samples are available in the ${RELEASEDIR}/samples/ llcluster directory. CLUSTER_REMOTE_JOB_FILTER Indicates the installation exit to be run by the inbound Schedd for each remote job request to filter the user’s job command file statements during submission or move job. If the keyword is not specified, no job filtering is done. Syntax: CLUSTER_REMOTE_JOB_FILTER = full_pathname_to_executable The installation exit is run with the submitting user’s ID. All parameters are character strings. This installation exit is executed on the inbound_hosts of the local cluster when receiving a job submission or move job request. The executable specified is called with the submitting user’s unfiltered job command file statements as the standard input. The standard output is submitted to LoadLeveler. If the exit returns with a nonzero exit code, the remote job submission or job move will fail. A submit filter can only make changes to LoadLeveler job command file statements. The data access API can be used by the remote job filter to query the Schedd for the job object received from the sending cluster. If the local submission filter on the submitting cluster has added or deleted steps from the original user’s job command file, the remote job filter must add or delete the same number of steps. The job command file statements returned by the remote job filter must contain the same number of steps as the job object received from the sending cluster. Changes to the following job command file keyword statements are ignored: v executable Chapter 12. Configuration file reference 271
  • 292. v environment v image_size v cluster_input_file v cluster_output_file v cluster_list The following job command file keyword will have different behavior: v initialdir – If not set by the remote job filter or the submitting user’s unfiltered job command file, the default value will remain the current working directory at the time the job was submitted. Access to the initialdir will be verified on the cluster selected to run the job. If access to initialdir fails, the submission or move job will fail. | When you distribute a scale-across job to other clusters for scheduling and a | remote job filter is configured, the filter will be applied to the distributed job. | However, only changes to the following job command file keyword statements | will be accepted. Changes to any other statement by the remote job filter will | be ignored. | v #@ class | v #@ priority | v #@ as_limit | v #@ core_limit | v #@ cpu_limit | v #@ data_limit | v #@ file_limit | v #@ job_cpu_limit | v #@ locks_limit | v #@ memlock_limit | v #@ nofile_limit | v #@ nproc_limit | v #@ rss_limit | v #@ stack_limit To maintain compatibility between the SUBMIT_FILTER and CLUSTER_REMOTE_JOB_FILTER programs, the following environment variables are set when either exit is invoked: v LOADL_ACTIVE – the LoadLeveler version. v LOADL_STEP_COMMAND – the location of the job command file passed as input to the program. This job command file only contains LoadLeveler keywords. v LOADL_STEP_ID – The job identifier, generated by the submitting LoadLeveler cluster. Note: The environment variable name is LOADL_STEP_ID although the value it contains is a ″job″ identifier. This name is used to be compatible with the local job filter interface. v LOADL_STEP_OWNER – The owner (UNIX user name) of the job. CLUSTER_USER_MAPPER Indicates the installation exit to be run by the inbound Schedd for each remote 272 TWS LoadLeveler: Using and Administering
  • 293. job request to determine the user mapping of the cluster. This keyword implies that user mapping is performed. If the keyword is not specified, no user mapping is done. Syntax: CLUSTER_USER_MAPPER = full_pathname_to_executable The installation exit is run with the following parameters passed as input. All parameters are character strings. v The user name to be mapped v The cluster name where the user originated from This installation exit is executed on the inbound_hosts of the local cluster when receiving a job submission, move job request or remote command. The installation exit must write the new user name as standard output and exit with a value of 0. An exit value of -1 indicates an error and the job is not submitted. STDERR from the exit is written to the Schedd log. An exit value of 1 indicates that the user name returned for this job was not mapped. CM_CHECK_USERID Specifies whether the central manager will check the existence of user IDs that sent requests through a command or API on the central manager machine. Syntax: CM_CHECK_USERID = true | false Default value: true COLLECTOR_DGRAM_PORT Specifies the port number used when connecting to a daemon. Syntax: CM_COLLECTOR_PORT = port number Default value: The default is 9612. COMM Specifies a local directory where LoadLeveler keeps special files used for UNIX domain sockets for communicating among LoadLeveler daemons running on the same machine. This keyword allows the administrator to choose a different file system other than /tmp for these files. If you change the COMM option you must stop and then restart LoadLeveler using the llctl command. Syntax: COMM = local directory Default value: The default location for the files is /tmp. CONTINUE Determines whether suspended jobs should continue execution. Syntax: CONTINUE: expression that evaluates to T or F (true or false) When T, suspended LoadLeveler jobs resume execution on the machine. Default value: No default value is set. For information about time-related variables that you may use for this keyword, see “Variables to use for setting times” on page 320. Chapter 12. Configuration file reference 273
  • 294. CUSTOM_METRIC Specifies a machine’s relative priority to run jobs. Syntax: CUSTOM_METRIC = number This is an arbitrary number which you can use in the MACHPRIO expression. Negative values are not allowed. Default value: If you specify neither CUSTOM_METRIC nor CUSTOM_METRIC_COMMAND, CUSTOM_METRIC = 1 is assumed. For more information, see “Setting negotiator characteristics and policies” on page 45. For more information related to using this keyword, see “Defining a LoadLeveler cluster” on page 44. CUSTOM_METRIC_COMMAND Specifies an executable and any required arguments. The exit code of this command is assigned to CUSTOM_METRIC. If this command does not exit normally, CUSTOM_METRIC is assigned a value of 1. This command is forked every (POLLING_FREQUENCY * POLLS_PER_UPDATE) period. Syntax: CUSTOM_METRIC_COMMAND = command Default value: No default is set; LoadLeveler does not run any command to determine CUSTOM_METRIC. DCE_AUTHENTICATION_PAIR Specifies a pair of installation supplied programs that are used to authenticate DCE security credentials. Restriction: DCE security is not supported by LoadLeveler for Linux. Syntax: DCE_AUTHENTICATION_PAIR = program1, program2 Where program1 and program2 are LoadLeveler- or installation-supplied programs that are used to authenticate DCE security credentials. program1 obtains a handle (an opaque credentials object), at the time the job is submitted, which is used to authenticate to DCE. program2 uses the handle obtained by program1 to authenticate to DCE before starting the job on the executing machines. Default value: See “Handling DCE security credentials” on page 74 for information about defaults. DEFAULT_PREEMPT_METHOD Specifies the default preemption method for LoadLeveler to use when a preempt method is not specified in a PREEMPT_CLASS statement or in the llpreempt command. LoadLeveler also uses this default preemption method to preempt job steps that are running on reserved machines when a reservation period begins. Restrictions: v This keyword is valid only for the BACKFILL scheduler. v The suspend method of preemption (the default) might not be supported on your level of Linux. If you want to preempt jobs that are running where process tracking is not supported, you must use this keyword to specify a method other than suspend. 274 TWS LoadLeveler: Using and Administering
  • 295. Syntax: DEFAULT_PREEMPT_METHOD = rm | sh | su | vc | uh Valid values are: rm LoadLeveler preempts the jobs and removes them from the job queue. To rerun the job, the user must resubmit the job to LoadLeveler. sh LoadLeveler ends the jobs and puts them into System Hold state. They remain in that state on the job queue until an administrator releases them. After being released, the jobs go into Idle state and will be rescheduled to run as soon as resources for the job are available. su LoadLeveler suspends the jobs and puts them in Preempted state. They remain in that state on the job queue until the preempting job has terminated, and resources are available to resume the preempted job on the same set of nodes. To use this value, process tracking must be enabled. vc LoadLeveler ends the jobs and puts them in Vacate state. They remain in that state on the job queue and will be rescheduled to run as soon as resources for the job are available. uh LoadLeveler ends the jobs and puts them into User Hold state. They remain in that state on the job queue until an administrator releases them. After being released, the jobs go into Idle state and will be rescheduled to run as soon as resources for the job are available. Default value: su (suspend method) For more information related to using this keyword, see “Steps for configuring a scheduler to preempt jobs” on page 130. DRAIN_ON_SWITCH_TABLE_ERROR Specifies whether the startd should be drained when the switch table fails to unload. This will flag the administrator that intervention may be required to unload the switch table. When DRAIN_ON_SWITCH_TABLE_ERROR is set to true, the startd will be drained when the switch table fails to unload. Syntax: DRAIN_ON_SWITCH_TABLE_ERROR = true | false Default value: false | DSTG_MAX_STARTERS | Specifies a machine-specific limit on the number of data staging initiators. | Since each task of a data staging job step consumes one initiator from the | data_stage class on the specified machine, DSTG_MAX_STARTERS provides | the maximum number of data staging tasks that can run at the same time on | the machine. | Syntax: | DSTG_MAX_STARTERS = number | Notes: | 1. If you have not set the DSTG_MAX_STARTERS value in either the | global or local configuration files, there will not be any data staging | initiators on the specified machine. In this configuration, the | compute node will not be allowed to perform data staging tasks. | 2. The value specified for DSTG_MAX_STARTERS will be the | number of initiators available for the built-in data_stage class on | that machine. Chapter 12. Configuration file reference 275
  • 296. | 3. The value specified for MAX_STARTERS will not limit the value | specified for DSTG_MAX_STARTERS. | Default value: 0 | DSTG_MIN_SCHEDULING_INTERVAL | Specifies a minimum interval between scheduling inbound data staging job | steps when they cannot be scheduled immediately. With a workload that | involves a lot of data staging jobs, this keyword can be adjusted down from | the default value of 900 seconds, if data staging jobs remain idle when there | are data staging resources available. Setting this keyword to a smaller interval | may impact scheduler performance when there is contention for data staging | resources and a large number of idle jobs in the queue. | Syntax: | DSTG_MIN_SCHEDULING_INTERVAL = seconds | Notes: | 1. You can only specify this keyword in the global configuration file; it | will be ignored in local configuration files. | 2. LoadLeveler ignores DSTG_MIN_SCHEDULING_INTERVAL | when DSTG_TIME=AT_SUBMIT. | Default value: 900 seconds | DSTG_TIME | Specifies that either: | AT_SUBMIT | LoadLeveler can schedule data staging steps any time after a job | requiring data staging has been submitted. | JUST_IN_TIME | LoadLeveler must schedule data staging job steps as close as possible | to the application job steps that were submitted in the same job. | Syntax: | DSTG_TIME = AT_SUBMIT | JUST_IN_TIME | Note: You can only specify the DSTG_TIME keyword in the global | configuration file. Any value specified for this keyword in local | configuration files will be ignored. | Default value: AT_SUBMIT ENFORCE_RESOURCE_MEMORY Specifies whether the AIX Workload Manager is configured to limit, as precisely as possible, the real memory usage of a WLM class. For this keyword to be valid, ConsumableMemory must be set through the ENFORCE_RESOURCE_USAGE keyword. Syntax: ENFORCE_RESOURCE_MEMORY = true | false Default value: false ENFORCE_RESOURCE_POLICY Specifies what type of resource entitlements will be assigned to the AIX Workload Manager classes. If the value specified is shares, it means a share value is assigned to the class based on the job step’s requested resources (one unit of resource equals one share). This is the default policy. If the value 276 TWS LoadLeveler: Using and Administering
  • 297. specified is soft, it means a percentage value is assigned to the class based on the job step’s requested resources and the total machine resources. This percentage can be exceeded if there is no contention for the resource. If the value specified is hard, it means a percentage value is assigned to the class based on the job step’s requested resources and the total machine resources. This percentage cannot be exceeded regardless of the contention for the | resource. This keyword is only valid for CPU and real memory with either | shares or percent limits. If desired, this keyword can be used in the LoadL_config.local file to set up a different policy for each machine. The ENFORCE_RESOURCE_USAGE keyword must be set for this keyword to be valid. Syntax: ENFORCE_RESOURCE_POLICY = hard |soft | shares Default value: shares ENFORCE_RESOURCE_SUBMISSION Indicates whether jobs submitted should be checked for the resources and node_resources keywords. If the value specified is true, LoadLeveler will check all jobs at submission time for the resources and node_resources keywords. The job command file resources and node_resources keywords combined need to have at least the resources specified in the ENFORCE_RESOURCE_USAGE keyword in order for the job to be submitted successfully. When RSET_MCM_AFFINITY is enabled, the task_affinity or parallel_threads keyword can be used instead of the resources and node_resources keywords when the resource being enforced is ConsumableCpus. If the value specified is false, no checking will be done and jobs submitted without the resources or node_resources keywords will not have resources enforced. In this instance, those jobs might interfere with other jobs whose resources are enforced. Syntax: ENFORCE_RESOURCE_SUBMISSION = true | false Default value: false ENFORCE_RESOURCE_USAGE | Specifies whether the AIX Workload Manager is used to enforce CPU and | memory resources. This keyword accepts either a value of deactivate or a list | of one or more of the following predefined resources: v ConsumableCpus v ConsumableMemory | v ConsumableVirtualMemory | v ConsumableLargePageMemory Either memory or CPUs or both can be enforced but the resources must also be specified on the SCHEDULE_BY_RESOURCES keyword. If deactivate is specified, LoadLeveler will deactivate AIX Workload Manager on all the nodes in the LoadLeveler cluster. Restriction: WLM enforcement is ignored by LoadLeveler for Linux. Syntax: | ENFORCE_RESOURCE_USAGE = name name ... name | deactivate Chapter 12. Configuration file reference 277
  • 298. EXECUTE Specifies the local directory to store the executables of jobs submitted by other machines. Syntax: EXECUTE = local directory/execute Default value: $(tilde)/execute FAIR_SHARE_INTERVAL Specifies, in units of hours, the time interval it takes for resource usage in fair share scheduling to decay to 5% of its initial value. Historic fair share data collected before the most recent time interval of this length will have little impact on fair share scheduling. Syntax: FAIR_SHARE_INTERVAL = hours Default value: The default value is 168 hours (one week). If a negative value or 0 is specified, the default value is used. FAIR_SHARE_TOTAL_SHARES Specifies the total number of shares that the cluster CPU or Blue Gene resources are divided into. If this value is less than or equal to 0, fair share scheduling is turned off. Syntax: FAIR_SHARE_TOTAL_SHARES = shares Default value: The default value is 0. FEATURE Specifies an optional characteristic to use to match jobs with machines. You can specify unique characteristics for any machine using this keyword. When evaluating job submissions, LoadLeveler compares any required features specified in the job command file to those specified using this keyword. You can have a maximum of 1024 characters in the feature statement. Syntax: Feature = {"string" ...} Default value: No default value is set. Example: If a machine has licenses for installed products ABC and XYZ in the local configuration file, you can enter the following: Feature = {"abc" "xyz"} When submitting a job that requires both of these products, you should enter the following in your job command file: requirements = (Feature == "abc") && (Feature == "xyz") Note: You must define a feature on all machines that will be able to run dynamic simultaneous multithreading (SMT). SMT is only supported on POWER6 and POWER5 processor-based systems. Example: When submitting a job that requires the SMT function, first specify smt = yes in job command file (or select a class which had smt = yes defined). Next, specify node_usage = not_shared and last, enter the following in the job command file: requirements = (Feature == "smt") 278 TWS LoadLeveler: Using and Administering
  • 299. FLOATING_RESOURCES Specifies which consumable resources are available collectively on all of the machines in the LoadLeveler cluster. The count for each resource must be an integer greater than or equal to zero, and each resource can only be specified once in the list. Any resource specified for this keyword that is not already listed in the SCHEDULE_BY_RESOURCES keyword will not affect job scheduling. If any resource is specified incorrectly with the FLOATING_RESOURCES keyword, then all floating resources will be ignored. ConsumableCpus, ConsumableMemory, | ConsumableVirtualMemory, and ConsumableLargePageMemory may not be specified as floating resources. Syntax: FLOATING_RESOURCES = name(count) name(count) ... name(count) Default value: No default value is set. FS_INTERVAL Defines the number of minutes used as the interval for checking free file system space or inodes. If your file system receives many log messages or copies large executables to the LoadLeveler spool, the file system will fill up quicker and you should perform file size checking more frequently by setting the interval to a smaller value. LoadLeveler will not check the file system if the value of FS_INTERVAL is: v Set to zero v Set to a negative integer Syntax: FS_INTERVAL = minutes Default value: If FS_INTERVAL is not specified but any of the other file-system keywords (FS_NOTIFY, FS_SUSPEND, FS_TERMINATE, INODE_NOTIFY, INODE_SUSPEND, INODE_TERMINATE) are specified, the FS_INTERVAL value will default to 5 and the file system will be checked. If no file-system or inode keywords are set, LoadLeveler does not monitor file systems at all. For more information related to using this keyword, see “Setting up file system monitoring” on page 54. FS_NOTIFY Defines the lower and upper amounts, in bytes, of free file-system space at which LoadLeveler is to notify the administrator: v If the amount of free space becomes less than the lower threshold value, LoadLeveler sends a mail message to the administrator indicating that logging problems may occur. v When the amount of free space becomes greater than the upper threshold value, LoadLeveler sends a mail message to the administrator indicating that problem has been resolved. Syntax: FS_NOTIFY = lower threshold, upper threshold Specify space in bytes with the unit B. A metric prefix such as K, M, or G may precede the B. The valid range for both the lower and upper thresholds are -1B and all positive integers. If the value is set to -1, the transition across the threshold is not checked. Default value: In bytes: 1KB, -1B Chapter 12. Configuration file reference 279
  • 300. For more information related to using this keyword, see “Setting up file system monitoring” on page 54. FS_SUSPEND Defines the lower and upper amounts, in bytes, of free file system space at which LoadLeveler drains and resumes the Schedd and startd daemons running on a node. v If the amount of free space becomes less than the lower threshold value, then LoadLeveler drains the Schedd and the startd daemons if they are running on a node. When this happens, logging is turned off and mail notification is sent to the administrator. v When the amount of free space becomes greater than the upper threshold value, LoadLeveler signals the Schedd and the startd daemons to resume. When this happens, logging is turned on and mail notification is sent to the administrator. Syntax: FS_SUSPEND = lower threshold, upper threshold Specify space in bytes with the unit B. A metric prefix such as K, M, or G may precede the B. The valid range for both the lower and upper thresholds are -1B and all positive integers. If the value is set to -1, the transition across the threshold is not checked. Default value: In bytes: -1B, -1B For more information related to using this keyword, see “Setting up file system monitoring” on page 54. FS_TERMINATE Defines the lower and upper amounts, in bytes, of free file system space at which LoadLeveler is terminated. This keyword sends the SIGTERM signal to the Master daemon which then terminates all LoadLeveler daemons running on the node. v If the amount of free space becomes less than the lower threshold value, all LoadLeveler daemons are terminated. v An upper threshold value is required for this keyword. However, since LoadLeveler has been terminated at the lower threshold, no action occurs. Syntax: FS_TERMINATE = lower threshold, upper threshold Specify space in bytes with the unit B. A metric prefix such as K, M, or G may precede the B. The valid range for the lower threshold is -1B and all positive integers. If the value is set to -1, the transition across the threshold is not checked. Default value: In bytes: -1B, -1B For more information related to using this keyword, see “Setting up file system monitoring” on page 54. GLOBAL_HISTORY Identifies the directory that will contain the global history files produced by llacctmrg command when no directory is specified as a command argument. Syntax: GLOBAL_HISTORY = directory Default value: The default value is $(SPOOL) (the local spool directory). 280 TWS LoadLeveler: Using and Administering
  • 301. For more information related to using this keyword, see “Collecting the accounting information and storing it into files” on page 66. GSMONITOR Location of the gsmonitor executable (LoadL_GSmonitor). Restriction: This keyword is ignored by LoadLeveler for Linux. Syntax: GSMONITOR = directory Default value: $(BIN)/LoadL_GSmonitor GSMONITOR_COREDUMP_DIR Local directory for storing LoadL_GSmonitor core dump files. Restriction: This keyword is ignored by LoadLeveler for Linux. Syntax: GSMONITOR_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. GSMONITOR_DOMAIN Specifies the peer domain, on which the GSMONITOR daemon will execute. Restriction: This keyword is ignored by LoadLeveler for Linux. Syntax: GSMONITOR_DOMAIN = PEER Default value: No default value is set. For more information related to using this keyword, see “The gsmonitor daemon” on page 14. GSMONITOR_RUNS_HERE Specifies whether the gsmonitor daemon will run on the host. Restriction: This keyword is ignored by LoadLeveler for Linux. Syntax: GSMONITOR_RUNS_HERE = TRUE | FALSE Default value: FALSE For more information related to using this keyword, see “The gsmonitor daemon” on page 14. HISTORY Defines the path name where a file containing the history of local LoadLeveler jobs is kept. Syntax: HISTORY = directory Default value: $(SPOOL)/history For more information related to using this keyword, see “Collecting the accounting information and storing it into files” on page 66. HISTORY_PERMISSION Specifies the owner, group, and world permissions of the history file associated with a LoadL_schedd daemon. Chapter 12. Configuration file reference 281
  • 302. Syntax: HISTORY_PERMISSION = permissions | rw-rw---- permissions must be a string with a length of nine characters and consisting of the characters, r, w, x, or -. Default value: The default settings are 660 (rw-rw----). LoadL_schedd will use the default setting if the specified permission are less than rw-------. Example: A specification such as HISTORY_PERMISSION = rw-rw-r-- will result in permission settings of 664. INODE_NOTIFY Defines the lower and upper amounts, in inodes, of free file-system inodes at which LoadLeveler is to notify the administrator: v If the number of free inodes becomes less than the lower threshold value, LoadLeveler sends a mail message to the administrator indicating that logging problems may occur. v When the number of free inodes becomes greater than the upper threshold value, LoadLeveler sends a mail message to the administrator indicating that problem has been resolved. Syntax: INODE_NOTIFY = lower threshold, upper threshold The valid range for both the lower and upper thresholds are -1 and all positive integers. If the value is set to -1, the transition across the threshold is not checked. Default value: In inodes: 1000, -1 For more information related to using this keyword, see “Setting up file system monitoring” on page 54. INODE_SUSPEND Defines the lower and upper amounts, in inodes, of free file system inodes at which LoadLeveler drains and resumes the Schedd and startd daemons running on a node. v If the number of free inodes becomes less than the lower threshold value, then LoadLeveler drains the Schedd and the startd daemons if they are running on a node. When this happens, logging is turned off and mail notification is sent to the administrator. v When the number of free inodes becomes greater than the upper threshold value, LoadLeveler signals the Schedd and the startd daemons to resume. When this happens, logging is turned on and mail notification is sent to the administrator. Syntax: INODE_SUSPEND = lower threshold, upper threshold The valid range for both the lower and upper thresholds are -1 and all positive integers. If the value is set to -1, the transition across the threshold is not checked. Default value: In inodes: -1, -1 For more information related to using this keyword, see “Setting up file system monitoring” on page 54. INODE_TERMINATE Defines the lower and upper amounts, in inodes, of free file system inodes at 282 TWS LoadLeveler: Using and Administering
  • 303. which LoadLeveler is terminated. This keyword sends the SIGTERM signal to the Master daemon which then terminates all LoadLeveler daemons running on the node. v If the number of free inodes becomes less than the lower threshold value, all LoadLeveler daemons are terminated. v An upper threshold value is required for this keyword. However, since LoadLeveler has been terminated at the lower threshold, no action occurs. Syntax: INODE_TERMINATE = lower threshold, upper threshold The valid range for the lower threshold is -1 and all positive integers. If the value is set to -1, the transition across the threshold is not checked. Default value: In inodes: -1, -1 For more information related to using this keyword, see “Setting up file system monitoring” on page 54. JOB_ACCT_Q_POLICY Specifies the amount of time, in seconds, that determines how often the startd daemon updates the Schedd daemon with accounting data of running jobs. This controls the accuracy of the llq -x command. Syntax: JOB_ACCT_Q_POLICY = number Default value: 300 seconds For more information related to using this keyword, see “Gathering job accounting data” on page 61. JOB_EPILOG Path name of the epilog program. Syntax: JOB_EPILOG = program name Default value: No default value is set. For more information related to using this keyword, see “Writing prolog and epilog programs” on page 77. JOB_LIMIT_POLICY Specifies the amount of time, in seconds, that LoadLeveler checks to see if job_cpu_limit has been exceeded. The smaller of JOB_LIMIT_POLICY and JOB_ACCT_Q_POLICY is used to control how often the startd daemon collects resource consumption data on running jobs, and how often the job_cpu_limit is checked. Syntax: JOB_LIMIT_POLICY = number Default value: The default for JOB_LIMIT_POLICY is POLLING_FREQUENCY multiplied by POLLS_PER_UPDATE. JOB_PROLOG Path name of the prolog program. Syntax: JOB_PROLOG = program name Default value: No default value is set. Chapter 12. Configuration file reference 283
  • 304. For more information related to using this keyword, see “Writing prolog and epilog programs” on page 77. JOB_USER_EPILOG Path name of the user epilog program. Syntax: JOB_USER_EPILOG = program name Default value: No default value is set. For more information related to using this keyword, see “Writing prolog and epilog programs” on page 77. JOB_USER_PROLOG Path name of the user prolog program. Syntax: JOB_USER_PROLOG = program name Default value: No default value is set. For more information related to using this keyword, see “Writing prolog and epilog programs” on page 77. KBDD Location of kbdd executable (LoadL_kbdd). Syntax: KBDD = directory Default value: $(BIN)/LoadL_kbdd KBDD_COREDUMP_DIR Local directory for storing LoadL_kbdd daemon core dump files. Syntax: KBDD_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. KILL Determines whether or not vacated jobs should be sent the SIGKILL signal and replaced in the queue. It is used to remove a job that is taking too long to vacate. Syntax: KILL: expression that evaluates to T or F (true or false) When T, vacated LoadLeveler jobs are removed from the machine with no attempt to take checkpoints. For information about time-related variables that you may use for this keyword, see “Variables to use for setting times” on page 320. LIB Defines the directory where LoadLeveler libraries are kept. Syntax: LIB = directory Default value: $(RELEASEDIR)/lib 284 TWS LoadLeveler: Using and Administering
  • 305. LL_RSH_COMMAND Specifies an administrator provided executable to be used by llctl start when starting LoadLeveler on remote machines in the administration file. The LL_RSH_COMMAND keyword is any executable that can be used as a substitute for /usr/bin/rsh. The llctl start command passes arguments to the executable specified by LL_RSH_COMMAND in the following format: LL_RSH_COMMAND hostname -n llctl start options Syntax: LL_RSH_COMMAND = full_path_to_executable Default value: /usr/bin/rsh. This keyword must specify the full path name to the executable provided. If no value is specified, LoadLeveler will use /usr/bin/rsh as the default when issuing a start. If an error occurred while locating the executable specified, an error message will be displayed. Example: This example shows that using the secure shell (/usr/bin/ssh) is the preferred method for the llctl start command to communicate with remote nodes. Specify the following in the configuration file: LL_RSH_COMMAND=/usr/bin/ssh LOADL_ADMIN Specifies a list of LoadLeveler administrators. Syntax: LOADL_ADMIN = list of user names Where list of user names is a blank-delimited list of those individuals who will have administrative authority. These users are able to invoke the administrator-only commands such as llctl, llfavorjob, and llfavoruser. These administrators can also invoke the administrator-only GUI functions. For more information, see Chapter 7, “Using LoadLeveler’s GUI to perform administrator tasks,” on page 169. Default value: No default value is set, which means no one has administrator authority until this keyword is defined with one or more user names. Example: To grant administrative authority to users bob and mary, enter the following in the configuration file: LOADL_ADMIN = bob mary For more information related to using this keyword, see “Defining LoadLeveler administrators” on page 43. LOCAL_CONFIG Specifies the path name of the optional local configuration file containing information specific to a node in the LoadLeveler network. Syntax: LOCAL_CONFIG = directory Default value: No default value is set. Examples: v If you are using a distributed file system like NFS, some examples are: LOCAL_CONFIG = $(tilde)/$(host).LoadL_config.local LOCAL_CONFIG = $(tilde)/LoadL_config.$(host).$(domain) LOCAL_CONFIG = $(tilde)/LoadL_config.local.$(hostname) Chapter 12. Configuration file reference 285
  • 306. See “LoadLeveler variables” on page 314 for information about the tilde, host, and domain variables. v If you are using a local file system, an example is: LOCAL_CONFIG = /var/LoadL/LoadL_config.local LOG Defines the local directory to store log files. It is not necessary to keep all the log files created by the various LoadLeveler daemons and programs in one directory, but you will probably find it convenient to do so. Syntax: LOG = local directory/log Default value: $(tilde)/log LOG_MESSAGE_THRESHOLD Specifies the maximum amount of memory, in bytes, for the message queue. Messages in the queue are waiting to be written to the log file. When the message logging thread cannot write messages to the log file as fast as they arrive, the memory consumed by the message queue can exceed the threshold. In this case, LoadLeveler will curtail logging by turning off all debug flags except D_ALWAYS, therefore, reducing the amount of logging that takes place. If the threshold is exceeded by the curtailed message queue, message logging is stopped. Special log messages are written to the log file, which indicate that some messages are missing. Mail is also sent to the administrator indicating that messages are missing. A value of -1 for this keyword will turn off the buffer threshold meaning that the threshold is unlimited. Syntax: LOG_MESSAGE_THRESHOLD = bytes Default value: 20*1024*1024 (bytes) MACHINE_AUTHENTICATE Specifies whether machine validation is performed. When set to true, LoadLeveler only accepts connections from machines specified in the administration file. When set to false, LoadLeveler accepts connections from any machine. When set to true, every communication between LoadLeveler processes will verify that the sending process is running on a machine which is identified via a machine stanza in the administration file. The validation is done by capturing the address of the sending machine when the accept function call is issued to accept a connection. The gethostbyaddr function is called to translate the address to a name, and the name is matched with the list derived from the administration file. | Note: You must not set the MACHINE_AUTHENTICATE keyword to true for | a cluster which is configured to be a main scale-across cluster. The main | scale-across cluster must permit communication with LoadLeveler | daemons running on any machine in any cluster participating in the | scale-across multicluster environment. Syntax: MACHINE_AUTHENTICATE = true | false Default value: false For more information related to using this keyword, see “Defining a LoadLeveler cluster” on page 44. 286 TWS LoadLeveler: Using and Administering
  • 307. MACHINE_UPDATE_INTERVAL Specifies the time, in seconds, during which machines must report to the central manager. Syntax: MACHINE_UPDATE_INTERVAL = number Where number specifies the time period, in seconds, during which machines must report to the central manager. Machines that do not report in this number of seconds are considered down. number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 300 seconds. For more information related to using this keyword, see “Setting negotiator characteristics and policies” on page 45. MACHPRIO Machine priority expression. Syntax: MACHPRIO = expression You can use the following LoadLeveler variables in the MACHPRIO expression: v LoadAvg v Connectivity v Cpus v Speed v Memory v VirtualMemory v Disk v CustomMetric v MasterMachPriority v ConsumableCpus v ConsumableMemory v ConsumableVirtualMemory | v ConsumableLargePageMemory v PagesFreed v PagesScanned v FreeRealMemory For detailed descriptions of these variables, see “LoadLeveler variables” on page 314. Default value: (0 - LoadAvg) Examples: v Example 1 This example orders machines by the Berkeley one-minute load average. MACHPRIO : 0 - (LoadAvg) Therefore, if LoadAvg equals .7, this example would read: MACHPRIO : 0 - (.7) The MACHPRIO would evaluate to -.7. v Example 2 Chapter 12. Configuration file reference 287
  • 308. This example orders machines by the Berkeley one-minute load average normalized for machine speed: MACHPRIO : 0 - (1000 * (LoadAvg / (Cpus * Speed))) Therefore, if LoadAvg equals .7, Cpus equals 1, and Speed equals 2, this example would read: MACHPRIO : 0 - (1000 * (.7 / (1 * 2))) This example further evaluates to: MACHPRIO : 0 - (350) The MACHPRIO would evaluate to -350. Notice that if the speed of the machine were increased to 3, the equation would read: MACHPRIO : 0 - (1000 * (.7 / (1 * 3))) The MACHPRIO would evaluate to approximately -233. Therefore, as the speed of the machine increases, the MACHPRIO also increases. v Example 3 This example orders machines accounting for real memory and available swap space (remembering that Memory is in Mbytes and VirtualMemory is in Kbytes): MACHPRIO : 0 - (10000 * (LoadAvg / (Cpus * Speed))) + (10 * Memory) + (VirtualMemory / 1000) v Example 4 This example sets a relative machine priority based on the value of the CUSTOM_METRIC keyword. MACHPRIO : CustomMetric To do this, you must specify a value for the CUSTOM_METRIC keyword or the CUSTOM_METRIC_COMMAND keyword in either the LoadL_config.local file of a machine or in the global LoadL_config file. To assign the same relative priority to all machines, specify the CUSTOM_METRIC keyword in the global configuration file. For example: CUSTOM_METRIC = 5 You can override this value for an individual machine by specifying a different value in that machine’s LoadL_config.local file. v Example 5 This example gives master nodes the highest priority: MACHPRIO : (MasterMachPriority * 10000) v Example 6 This example gives nodes the with highest percentage of switch adapters with connectivity the highest priority: MACHPRIO : Connectivity For more information related to using this keyword, see “Setting negotiator characteristics and policies” on page 45. MAIL Name of a local mail program used to override default mail notification. Syntax: MAIL = program name Default value: No default value is set. 288 TWS LoadLeveler: Using and Administering
  • 309. For more information related to using this keyword, see “Using your own mail program” on page 81. MASTER Location of the master executable (LoadL_master). Syntax: MASTER = directory Default value: $(BIN)/LoadL_master For more information related to using this keyword, see “How LoadLeveler daemons process jobs” on page 8. MASTER_COREDUMP_DIR Local directory for storing LoadL_master core dump files. Syntax: MASTER_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. MASTER_DGRAM_PORT The port number used when connecting to the daemon. Syntax: MASTER_DGRAM_PORT = port number Default value: The default is 9617. For more information related to using this keyword, see “Defining network characteristics” on page 47. MASTER_STREAM_PORT Specifies the port number to be used when connecting to the daemon. Syntax: MASTER_STREAM_PORT = port number Default value: The default is 9616. For more information related to using this keyword, see “Defining network characteristics” on page 47. MAX_CKPT_INTERVAL The maximum number of seconds between checkpoints for running jobs. Syntax: MAX_CKPT_INTERVAL = number Default value: 7200 (2 hours) For more information related to using this keyword, see “LoadLeveler support for checkpointing jobs” on page 139. MAX_JOB_REJECT Determines the number of times a job is rejected before it is canceled or put in User Hold or System Hold status. Syntax: MAX_JOB_REJECT = number Chapter 12. Configuration file reference 289
  • 310. number must be a numerical value and cannot be an arithmetic expression. MAX_JOB_REJECT may be set to unlimited rejects by specifying a value of –1. Default value: The default value is 0, which indicates a rejected job will immediately be canceled or placed on hold. For related information, see the NEGOTIATOR_REJECT_DEFER keyword. MAX_RESERVATIONS Specifies the maximum number of reservations that this LoadLeveler cluster can have. Only reservations in waiting and in use are counted toward this limit; LoadLeveler does not count reservations that have already ended or are in the process of being canceled. Notes: 1. Having too many reservations in a LoadLeveler cluster can have performance impacts. Administrators should select a suitable value for this keyword. | 2. A recurring reservation only counts as one reservation towards the | MAX_RESERVATIONS limit regardless of the number of times that | the reservation recurs. Syntax: MAX_RESERVATIONS = number The value for this keyword can be 0 or a positive integer. Default value: The default is 10. MAX_STARTERS Specifies the maximum number of tasks that can run simultaneously on a machine. In this case, a task can be a serial job step or a parallel task. MAX_STARTERS defines the number of initiators on the machine (the number of tasks that can be initiated from a startd). Syntax: MAX_STARTERS = number Default value: If this keyword is not specified, the default is the number of elements in the Class statement. For more information related to using this keyword, see “Specifying how many jobs a machine can run” on page 55. MAX_TOP_DOGS Specifies the maximum total number of top dogs that the central manager daemon will allocate. When scheduling jobs, after MAX_TOP_DOGS total top dogs have been allocated, no more will be considered. Syntax: MAX_TOP_DOGS = k | 1 where: k is a non-negative integer specifying the global maximum top dogs limit. Default value: The default value is 1. For more information related to using this keyword, see “Using the BACKFILL scheduler” on page 110. MIN_CKPT_INTERVAL The minimum number of seconds between checkpoints for running jobs. 290 TWS LoadLeveler: Using and Administering
  • 311. Syntax: MIN_CKPT_INTERVAL = number Default value: 900 (15 minutes) For more information related to using this keyword, see “LoadLeveler support for checkpointing jobs” on page 139. NEGOTIATOR Location of the negotiator executable (LoadL_negotiator). Syntax: NEGOTIATOR = directory Default value: $(BIN)/LoadL_negotiator For more information related to using this keyword, see “How LoadLeveler daemons process jobs” on page 8. NEGOTIATOR_COREDUMP_DIR Local directory for storing LoadL_negotiator core dump files. Syntax: NEGOTIATOR_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. NEGOTIATOR_CYCLE_DELAY Specifies the minimum time, in seconds, the negotiator delays between periods when it attempts to schedule jobs. This time is used by the negotiator daemon to respond to queries, reorder job queues, collect information about changes in the states of jobs, and so on. Delaying the scheduling of jobs might improve the overall performance of the negotiator by preventing it from spending excessive time attempting to schedule jobs. Syntax: NEGOTIATOR_CYCLE_DELAY = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 0 seconds NEGOTIATOR_CYCLE_TIME_LIMIT Specifies the maximum amount of time, in seconds, that LoadLeveler will allow the negotiator to spend in one cycle trying to schedule jobs. The negotiator cycle will end, after the specified number of seconds, even if there are additional jobs waiting for dispatch. Jobs waiting for dispatch will be considered at the next negotiator cycle. The NEGOTIATOR_CYCLE_TIME_LIMIT keyword applies only to the BACKFILL scheduler. Syntax: NEGOTIATOR_CYCLE_TIME_LIMIT = number Where number must be a positive integer or zero and cannot be an arithmetic expression. Default value: If the keyword value is not specified or a value of zero is used, the negotiator cycle will be unlimited. Chapter 12. Configuration file reference 291
  • 312. NEGOTIATOR_INTERVAL The time interval, in seconds, at which the negotiator daemon updates the status of jobs in the LoadLeveler cluster and negotiates with machines that are available to run jobs. Syntax: NEGOTIATOR_INTERVAL = number Where number specifies the interval, in seconds, at which the negotiator daemon performs a “negotiation loop” during which it attempts to assign available machines to waiting jobs. A negotiation loop also occurs whenever job states or machine states change. number must be a numerical value and cannot be an arithmetic expression. When this keyword is set to zero, the central manager’s automatic scheduling activity is been disabled, and LoadLeveler will not attempt to schedule any jobs unless instructed to do so through the llrunscheduler command or ll_run_scheduler subroutine. Default value: The default is 30 seconds. For more information related to using this keyword, see “Controlling the central manager scheduling cycle” on page 73. NEGOTIATOR_LOADAVG_INCREMENT Specifies the value the negotiator adds to the startd machine’s load average whenever a job in the Pending state is queued on that machine. This value is used to compensate for the increased load caused by starting another job. Syntax: NEGOTIATOR_LOADAVG_INCREMENT = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default value is .5 NEGOTIATOR_PARALLEL_DEFER Specifies the amount of time, in seconds, that defines how long a job stays out of the queue after it fails to get the correct number of processors. This keyword applies only to the default LoadLeveler scheduler. This keyword must be greater than the NEGOTIATOR_INTERVAL. value; if it is not, the default is used. Syntax: NEGOTIATOR_PARALLEL_DEFER = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is NEGOTIATOR_INTERVAL multiplied by 5. NEGOTIATOR_PARALLEL_HOLD Specifies the amount of time, in seconds, that defines how long a job is given to accumulate processors. This keyword applies only to the default LoadLeveler scheduler. This keyword must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the default is used. Syntax: NEGOTIATOR_PARALLEL_HOLD = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is NEGOTIATOR_INTERVAL multiplied by 5. 292 TWS LoadLeveler: Using and Administering
  • 313. NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL Specifies the amount of time, in seconds, between calculation of the SYSPRIO values for waiting jobs. Recalculating the priority can be CPU-intensive; specifying low values for the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword may lead to a heavy CPU load on the negotiator if a large number of jobs are running or waiting for resources. A value of 0 means the SYSPRIO values are not recalculated. You can use this keyword to base the order in which jobs are run on the current number of running, queued, or total jobs for a user or a group. Syntax: NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 120 seconds. NEGOTIATOR_REJECT_DEFER Specifies the amount of time in seconds the negotiator waits before it considers scheduling a job to a machine that recently rejected the job. Syntax: NEGOTIATOR_REJECT_DEFER = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 120 seconds. For related information, see the MAX_JOB_REJECT keyword. NEGOTIATOR_REMOVE_COMPLETED Specifies the amount of time, in seconds, that you want the negotiator to keep information regarding completed and removed jobs so that you can query this information using the llq command. Syntax: NEGOTIATOR_REMOVE_COMPLETED = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 0 seconds. NEGOTIATOR_RESCAN_QUEUE specifies the amount of time in seconds that defines how long the negotiator waits to rescan the job queue for machines which have bypassed jobs which could not run due to conditions which may change over time. This keyword must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the default is used. Syntax: NEGOTIATOR_RESCAN_QUEUE = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 900 seconds. NEGOTIATOR_STREAM_PORT Specifies the port number used when connecting to the daemon. Syntax: NEGOTIATOR_STREAM_PORT = port number Chapter 12. Configuration file reference 293
  • 314. Default value: The default is 9614. For more information related to using this keyword, see “Defining network characteristics” on page 47. OBITUARY_LOG_LENGTH Specifies the number of lines from the end of the file that are appended to the mail message. The master daemon mails this log to the LoadLeveler administrators when one of the daemons dies. Syntax: OBITUARY_LOG_LENGTH = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 25. POLLING_FREQUENCY Specifies the interval, in seconds, with which the startd daemon evaluates the load on the local machine and decides whether to suspend, resume, or abort jobs. This time is also the minimum interval at which the kbdd daemon reports keyboard or mouse activity to the startd daemon. Syntax: POLLING_FREQUENCY = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 5. POLLS_PER_UPDATE Specifies how often, in POLLING_FREQUENCY intervals, startd daemon updates the central manager. Due to the communication overhead, it is impractical to do this with the frequency defined by the POLLING_FREQUENCY keyword. Therefore, the startd daemon only updates the central manager every nth (where n is the number specified for POLLS_PER_UPDATE) local update. Change POLLS_PER_UPDATE when changing the POLLING_FREQUENCY. Syntax: POLLS_PER_UPDATE = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 24. PRESTARTED_STARTERS Specifies how many prestarted starter processes LoadLeveler will maintain on an execution node to manage jobs when they arrive. The startd daemon starts the number of starter processes specified by this keyword. You may specify this keyword in either the global or local configuration file. Syntax: PRESTARTED_STARTERS = number number must be less than or equal to the value specified through the MAX_STARTERS keyword. If the value of PRESTARTED_STARTERS specified is greater then MAX_STARTERS, LoadLeveler records a warning message in the startd log and assigns PRESTARTED_STARTERS the same value as MAX_STARTERS. 294 TWS LoadLeveler: Using and Administering
  • 315. If the value PRESTARTED_STARTERS is zero, no starter processes will be started before jobs arrive on the execution node. Default value: The default is 1. PREEMPT_CLASS Defines the preemption rule for a job class. Syntax: The following forms illustrate correct syntax. PREEMPT_CLASS[incoming_class] = ALL[:preempt_method] { outgoing_class1 [outgoing_class2 ...] } Using this form, ALL indicates that job steps of incoming_class have priority and will not share nodes with job steps of outgoing_class1, outgoing_class2, or other outgoing classes. If a job step of the incoming_class is to be started on a set of nodes, all job steps of outgoing_class1, outgoing_class2, or other outgoing classes running on those nodes will be preempted. Note: The ALL preemption rule does not apply to Blue Gene jobs. PREEMPT_CLASS[incoming_class] = ENOUGH[:preempt_method] { outgoing_class1 [outgoing_class2 ...] } Using this form, ENOUGH indicates that job steps of incoming_class will share nodes with job steps of outgoing_class1, outgoing_class2, or other outgoing classes if there are sufficient resources. If a job step of the incoming_class is to be started on a set of nodes, one or more job steps of outgoing_class1, outgoing_class2, or other outgoing classes running on those nodes may be preempted to get needed resources. Combinations of these forms are also allowed. Note: 1. The optional specification preempt_method indicates which method LoadLeveler is to use to preempt the jobs; this specification is valid only for the BACKFILL scheduler. Valid values for this specification in keyword syntax are the highlighted abbreviations in parentheses: v Remove (rm) v System hold (sh) v Suspend (su) v Vacate (vc) v User hold (uh) For more information about preemption methods, see “Steps for configuring a scheduler to preempt jobs” on page 130. 2. Using the ″ALL″ value in the PREEMPT_CLASS keyword places implied restrictions on when a job can start. See “Planning to preempt jobs” on page 128 for more information. 3. The incoming class is designated inside [ ] brackets. 4. Outgoing classes are designated inside { } curly braces. 5. The job classes on the right hand (outgoing) side of the statement must be different from incoming class, or it may be allclasses. If the outgoing side is defined as allclasses then all job classes are preemptable with the exception of the incoming class specified within brackets. Chapter 12. Configuration file reference 295
  • 316. 6. A class name or allclasses should not be in both the ALL list and the ENOUGH list. If you do so, the entire statement will be ignored. An example of this is: PREEMPT_CLASS[Class_A]=ALL{allclasses} ENOUGH {allclasses} 7. If you use allclasses as an outgoing (preemptable) class, then no other class names should be listed at the right hand side as the entire statement will be ignored. An example of this is: PREEMPT_CLASS[Class_A]=ALL{Class_B} ENOUGH {allclasses} 8. More than one ALL statement and more than one ENOUGH statement may appear at the right hand side. Multiple statements have a cumulative effect. 9. Each ALL or ENOUGH statement can have multiple class names inside the curly braces. However, a blank space delimiter is required between each class name. 10. Both the ALL and ENOUGH statements can include an optional specification indicating the method LoadLeveler will use to preempt the jobs. Valid values for this specification are listed in the description of the DEFAULT_PREEMPT_METHOD keyword. If a value is specified on the PREEMPT_CLASS ALL or ENOUGH statement, that value overrides the value set on the DEFAULT_PREEMPT_METHOD keyword, if any. 11. ALL and ENOUGH may be in mixed cases. 12. Spaces are allowed around the brackets and curly braces. 13. PREEMPT_CLASS [allclasses] will be ignored. Default value: No default value is set. Examples: PREEMPT_CLASS[Class_B]=ALL{Class_E Class_D} ENOUGH {Class_C} This indicates that all Class_E jobs and all Class_D jobs and enough Class_C jobs will be preempted to enable an incoming Class_B job to run. PREEMPT_CLASS[Class_D]=ENOUGH:VC {Class_E} This indicates that zero, one, or more Class_E jobs will be preempted using the vacate method to enable an incoming Class_D job to run. PREEMPTION_SUPPORT For the BACKFILL or API schedulers only, specifies the level of preemption support for a cluster. Syntax: PREEMPTION_SUPPORT= full | no_adapter | none v When set to full, preemption is fully supported. v When set to no_adapter, preemption is supported but the adapter resources are not released by preemption. v When set to none, preemption is not supported, and preemption requests will be rejected. Note: 1. If the value of this keyword is set to any value other than none for the default scheduler, LoadLeveler will not start. 296 TWS LoadLeveler: Using and Administering
  • 317. 2. For the BACKFILL or API scheduler, when this keyword is set to full or no_adapter and preemption by the suspend method is required, the configuration keyword PROCESS_TRACKING must be set to true. Default value: The default value for all schedulers is none; if you want to enable preemption under these schedulers, you must set a value for this keyword. PROCESS_TRACKING Specifies whether or not LoadLeveler will cancel any processes (throughout the entire cluster), left behind when a job terminates. Syntax: PROCESS_TRACKING = TRUE | FALSE When TRUE ensures that when a job is terminated, no processes created by the job will continue running. Note: It is necessary to set this keyword to true to do preemption by the suspend method with the BACKFILL or API scheduler. Default value: FALSE PROCESS_TRACKING_EXTENSION Specifies the directory containing the kernel module LoadL_pt_ke (AIX) or proctrk.ko (Linux). Syntax: PROCESS_TRACKING_EXTENSION = directory Default value: The directory $HOME/bin For more information related to using this keyword, see “Tracking job processes” on page 70. PUBLISH_OBITUARIES Specifies whether or not the master daemon sends mail to the administrator when any daemon it manages ends abnormally. When set to true, this keyword specifies that the master daemon sends mail to the administrators identified by LOADL_ADMIN keyword. Syntax: PUBLISH_OBITUARIES = true | false Default value: true REJECT_ON_RESTRICTED_LOGIN Specifies whether the user’s account status will be checked on every node where the job will be run by calling the AIX loginrestrictions function with the S_DIST_CLNT flag. Restriction: Login restriction checking is ignored by LoadLeveler for Linux. Login restriction checking includes: v Does the account still exist? v Is the account locked? v Has the account expired? v Do failed login attempts exceed the limit for this account? v Is login disabled via /etc/nologin? Chapter 12. Configuration file reference 297
  • 318. If the AIX loginrestrictions function indicates a failure then the user’s job will be rejected and will be processed according to the LoadLeveler configuration parameters MAX_JOB_REJECT and ACTION_ON_MAX_REJECT. Syntax: REJECT_ON_RESTRICTED_LOGIN = true | false Default value: false RELEASEDIR Defines the directory where all the LoadLeveler software resides. Syntax: RELEASEDIR = release directory Default value: $(RELEASEDIR) RESERVATION_CAN_BE_EXCEEDED Specifies whether LoadLeveler will schedule job steps that are bound to a reservation when their end times (based on hard wall-clock limits) exceed the reservation end time. Syntax: RESERVATION_CAN_BE_EXCEEDED = true | false When this keyword is set to false, LoadLeveler schedules only those job steps that will complete before the reservation ends. When set to true, LoadLeveler schedules job steps to run under a reservation even if their end times are expected to exceed the reservation end time. When the reservation ends, however, the reserved nodes no longer belong to the reservation, and so these nodes might not be available for the jobs to continue running. In this case, LoadLeveler might preempt the running jobs. Note that this keyword setting does not change the actual end time of the reservation. It only affects how LoadLeveler manages job steps whose end times exceed the end time of the reservation. Default value: true RESERVATION_HISTORY Defines the name of a file that is to contain the local history of reservations. Syntax: RESERVATION_HISTORY = file name | LoadLeveler appends a single line to the reservation history file for each | completed occurrence of each reservation. For an example, see “Collecting | accounting data for reservations” on page 63. Default value: $(SPOOL)/reservation_history RESERVATION_MIN_ADVANCE_TIME Specifies the minimum time, in minutes, between the time at which a reservation is created and the time at which the reservation is to start. Syntax: RESERVATION_MIN_ADVANCE_TIME = number of minutes By default, the earliest time at which a reservation may start is the current time plus the value set for the RESERVATION_SETUP_TIME keyword. Default value: 0 (zero) 298 TWS LoadLeveler: Using and Administering
  • 319. RESERVATION_PRIORITY Specifies whether LoadLeveler administrators may reserve nodes on which running jobs are expected to end after the reservation start time. This keyword value applies only for LoadLeveler administrators; other reservation owners do not have this capability. Syntax: RESERVATION_PRIORITY = NONE | HIGH When you set this keyword to HIGH, before activating the reservation, LoadLeveler preempts the job steps running on the reserved nodes (Blue Gene job steps are handled the same way). The only exceptions are non-preemptable jobs; LoadLeveler will not preempt those jobs because of any reservations. Default value: NONE RESERVATION_SETUP_TIME Specifies how much time, in seconds, that LoadLeveler may use to prepare for a reservation before it is to start. The tasks that LoadLeveler performs during this time include checking and reporting node conditions, and preempting job steps still running on the reserved nodes. For a given reservation, LoadLeveler uses the RESERVATION_SETUP_TIME keyword value that is set at the time that the reservation is created, not whatever value might be set when the reservation starts. If the start time of the reservation is modified, however, LoadLeveler uses the RESERVATION_SETUP_TIME keyword value that is set at the time of the modification. Syntax: RESERVATION_SETUP_TIME = number of seconds Default value: 60 RESTARTS_PER_HOUR Specifies how many times the master daemon attempts to restart a daemon that dies abnormally. Because one or more of the daemons may be unable to run due to a permanent error, the master only attempts $(RESTARTS_PER_HOUR) restarts within a 60 minute period. Failing that, it sends mail to the administrators identified by the LOADL_ADMIN keyword and exits. Syntax: RESTARTS_PER_HOUR = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 12. RESUME_ON_SWITCH_TABLE_ERROR_CLEAR Specifies whether or not the startd that was drained when the switch table failed to unload will automatically resume once the unload errors are cleared. The unload error is considered cleared after LoadLeveler can successfully unload the switch table. For this keyword to work, the DRAIN_ON_SWITCH_TABLE_ERROR option in the configuration file must be turned on and not disabled. Flushing, suspending, or draining of a startd manually or automatically will disable this option until the startd is manually resumed. Syntax: RESUME_ON_SWITCH_TABLE_ERROR_CLEAR = true | false Chapter 12. Configuration file reference 299
  • 320. Default value: false RSET_SUPPORT Indicates the level of RSet support present on a machine. Syntax: RSET_SUPPORT = option The available options are: RSET_MCM_AFFINITY Indicates that the machine can run jobs requesting MCM (memory or adapter) and processor (cache or SMT) affinity. RSET_NONE Indicates that LoadLeveler RSet support is not available on the machine. RSET_USER_DEFINED Indicates that the machine can be used for jobs with a user-created RSet in their job command file. Default value: RSET_NONE SAVELOGS Specifies the directory in which log files are archived. Syntax: SAVELOGS = directory Where directory is the directory in which log files will be archived. Default value: No default value is set. For more information related to using this keyword, see “Configuring recording activity and log files” on page 48. SAVELOGS_COMPRESS_PROGRAM Compresses logs after they are copied to the SAVELOGS directory. If not specified, SAVELOGS are copied, but are not compressed. Syntax: SAVELOGS_COMPRESS_PROGRAM = program Where program is a specific executable program. It can be a system-provided facility (such as, /bin/gzip) or an administrator-provided executable program. The value must be a full path name and can contain command-line arguments. LoadLeveler will call the program as: program filename. Default value: If blank, the logs are not compressed. Example: In this example, LoadLeveler will run the gzip -f command. The log file in SAVELOGS will be compressed after it is copied to SAVELOGS. If the program cannot be found or is not executable, LoadLeveler will log the error and SAVELOGS will remain uncompressed. SAVELOGS_COMPRESS_PROGRAM = /bin/gzip -f | SCALE_ACROSS_SCHEDULING_TIMEOUT | Defines the amount of time a central manager will wait: | v For the main cluster central manager, this value defines the wait time for | responses from the non-main cluster central managers when it is scheduling | scale-across jobs. 300 TWS LoadLeveler: Using and Administering
  • 321. | v For the non-main cluster central managers, this value limits how long the | central manager on each non-main cluster will hold resources for a | scale-across job step while waiting for an order to start the job. | Syntax: | scale_across_scheduling_timeout = number | Default value: 300 seconds SCHEDD Location of the Schedd executable (LoadL_schedd). Syntax: SCHEDD = directory Default value: $(BIN)/LoadL_schedd For more information related to using this keyword, see “How LoadLeveler daemons process jobs” on page 8. SCHEDD_COREDUMP_DIR Specifies the local directory for storing LoadL_schedd core dump files. Syntax: SCHEDD_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. SCHEDD_INTERVAL Specifies the interval, in seconds, at which the Schedd daemon checks the local job queue and updates the negotiator daemon. Syntax: SCHEDD_INTERVAL = number number must be a numerical value and cannot be an arithmetic expression. Default value: The default is 60 seconds. SCHEDD_RUNS_HERE Specifies whether the Schedd daemon runs on the host. If you do not want to run the Schedd daemon, specify false. This keyword does not designate a machine as a public scheduling machine. Unless configured as a public scheduling machine, a machine configured to run the Schedd daemon will only accept job submissions from the same machine running the Schedd daemon. A public scheduling machine accepts job submissions from other machines in the LoadLeveler cluster. To configure a machine as a public scheduling machine, see the schedd_host keyword description in “Administration file keyword descriptions” on page 327. Syntax: SCHEDD_RUNS_HERE = true | false Default value: true SCHEDD_SUBMIT_AFFINITY Specifies whether job submissions are directed to a locally running Schedd daemon. When the keyword is set to true, job submissions are directed to a Schedd daemon running on the same machine where the submission takes place, provided there is a Schedd daemon running on that machine. In this Chapter 12. Configuration file reference 301
  • 322. case the submission is said to have ″affinity″ for the local Schedd daemon. If there is no Schedd daemon running on the machine where the submission takes place, or if this keyword is set to false, the job submission will only be directed to a Schedd daemon serving as a public scheduling machine. In this case, if there are no public scheduling machines configured the job cannot be submitted. A public scheduling machine accepts job submissions from other machines in the LoadLeveler cluster. To configure a machine as a public scheduling machine, see the schedd_host keyword description in “Administration file keyword descriptions” on page 327. Installations with a large number of nodes should consider setting this keyword to false to more evenly distribute dispatching of jobs among the Schedd daemons. For more information, see “Scaling considerations” on page 719. Syntax: SCHEDD_SUBMIT_AFFINITY = true | false Default value: true SCHEDD_STATUS_PORT Specifies the port number used when connecting to the daemon. Syntax: SCHEDD_STATUS_PORT = port number Default value: The default is 9606. For more information related to using this keyword, see “Defining network characteristics” on page 47. SCHEDD_STREAM_PORT Specifies the port number used when connecting to the daemon. Syntax: SCHEDD_STREAM_PORT = port number Default value: The default is 9605. For more information related to using this keyword, see “Defining network characteristics” on page 47. SCHEDULE_BY_RESOURCES Specifies which consumable resources are considered by the LoadLeveler schedulers. Each consumable resource name may be an administrator-defined alphanumeric string, or may be one of the following predefined resources: v ConsumableCpus v ConsumableMemory v ConsumableVirtualMemory | v ConsumableLargePageMemory v RDMA Each string may only appear in the list once. These resources are either floating resources, or machine resources. If any resource is specified incorrectly with the SCHEDULE_BY_RESOURCES keyword, then all scheduling resources will be ignored. Syntax: SCHEDULE_BY_RESOURCES = name name ... name Default value: No default value is set. 302 TWS LoadLeveler: Using and Administering
  • 323. SCHEDULER_TYPE Specifies the LoadLeveler scheduling algorithm: LL_DEFAULT Specifies the default LoadLeveler scheduling algorithm. If SCHEDULER_TYPE has not been defined, LoadLeveler will use the default scheduler (LL_DEFAULT). BACKFILL Specifies the LoadLeveler BACKFILL scheduler. When you specify this keyword, you should use only the default settings for the START expression and the other job control expressions described in “Managing job status through control expressions” on page 68. API Specifies that you will use an external scheduler. External schedulers communicate to LoadLeveler through the job control API. For more information on setting an external scheduler, see “Using an external scheduler” on page 115. Syntax: SCHEDULER_TYPE = LL_DEFAULT | BACKFILL | API Default value: LL_DEFAULT Note: 1. If a scheduler type is not set, LoadLeveler will start, but it will use the default scheduler. 2. If you have set SCHEDULER_TYPE with an option that is not valid, LoadLeveler will not start. 3. If you change the scheduler option specified by SCHEDULER_TYPE, you must stop and restart LoadLeveler using llctl or recycle using llctl. For more information related to using this keyword, see “Defining a LoadLeveler cluster” on page 44. SEC_ADMIN_GROUP When security services are enabled, this keyword points to the name of the UNIX group that contains the local identities of the LoadLeveler administrators. Restriction: CtSec security is not supported on LoadLeveler for Linux. Syntax: SEC_ADMIN_GROUP = name of lladmin group Default value: No default value is set. For more information related to using this keyword, see “Configuring LoadLeveler to use cluster security services” on page 57. SEC_ENABLEMENT Specifies the security mechanism to be used. Restriction: Do not set this keyword to CtSec in the configuration file for a Linux machine. CtSec security is not supported on LoadLeveler for Linux. Syntax: SEC_ENABLEMENT = COMPAT | CTSEC Default value: No default value is set. Chapter 12. Configuration file reference 303
  • 324. SEC_SERVICES_GROUP When security services are enabled, this keyword specifies the name of the LoadLeveler services group. Restriction: CtSec security is not supported on LoadLeveler for Linux. Syntax: SEC_SERVICES_GROUP=group name Where group name defines the identities of the LoadLeveler daemons. Default value: No default value is set. SEC_IMPOSED_MECHS Specifies a blank-delimited list of LoadLeveler’s permitted security mechanisms when Cluster Security (CtSec) services are enabled. Restriction: CtSec security is not supported on LoadLeveler for Linux. Syntax: Specify a blank delimited list containing combinations of the following values: none If this is the only value specified, then users will run unauthenticated and, if authorization is necessary, the job will fail. If this is not the only value specified, then users may run unauthenticated and, if authorization is necessary, the job will fail. unix If this is the only value specified, then UNIX host-based authentication will be used; otherwise, other mechanisms may be used. Default value: No default value is set. Example: SEC_IMPOSED_MECHS = none unix SPOOL Defines the local directory where LoadLeveler keeps the local job queue and checkpoint files Syntax: SPOOL = local directory/spool Default value: $(tilde)/spool START Determines whether a machine can run a LoadLeveler job. Syntax: START: expression that evaluates to T or F (true or false) When the expression evaluates to T, LoadLeveler considers dispatching a job to the machine. When you use a START expression that is based on the CPU load average, the negotiator may evaluate the expression as F even though the load average indicates the machine is Idle. This is because the negotiator adds a compensating factor to the startd machine’s load average every time the negotiator assigns a job. For more information, see the NEGOTIATOR_INTERVAL keyword. Default value: No default value is set, which means that no jobs will be started. For information about time-related variables that you may use for this keyword, see “Variables to use for setting times” on page 320. 304 TWS LoadLeveler: Using and Administering
  • 325. START_CLASS Specifies the rule for starting a job of the incoming_class. The START_CLASS rule is applied whenever the BACKFILL scheduler decides whether a job step of the incoming_class should start or not. Syntax: START_CLASS[incoming_class] = (start_class_expression) [ && (start_class_expression) ...] Where start_class_expression takes the form: run_class < number_of_tasks Which indicates that a job step of the incoming_class is only allowed to run on a node when the number of tasks of run_class running on that node is less than number_of_tasks. Note: 1. START_CLASS [allclasses] will be ignored. 2. The job class specified by run_class may be the same as or different from the class specified by incoming_class. 3. You can also define run_class as allclasses. If you do, the total number of all job tasks running on that node cannot exceed the value specified by number_of_tasks. 4. A class name or allclasses should not appear twice on the right-hand side of the keyword statement. However, you can use other class names with allclasses on the right hand side of the statement. 5. If there is more than one start_class_expression, you must use && between adjacent start_class_expressions. 6. Both the START keyword and the START_CLASS keyword have to be true before a new job can start. 7. Parenthesis ( ) are optional around start_class_expression. For information related to using this keyword, see “Planning to preempt jobs” on page 128. Default value: No default value is set. Examples: START_CLASS[Class_A] = (Class_A < 1) This statement indicates that a Class_A job can only start on nodes that do not have any Class_A jobs running. START_CLASS[Class_B] = allclasses < 5 This statement indicates that a Class_B job can only start on nodes with maximum 4 tasks running. START_DAEMONS Specifies whether to start the LoadLeveler daemons on the node. Syntax: START_DAEMONS = true | false Default value: true When true, the daemons are started. In most cases, you will probably want to set this keyword to true. An example of why this keyword would be set to false is if you want to run the daemons on most of the machines in the cluster but some individual users with their own local configuration files do not want their machines to run the daemons. The individual users would modify their Chapter 12. Configuration file reference 305
  • 326. local configuration files and set this keyword to false. Because the global configuration file has the keyword set to true, their individual machines would still be able to participate in the LoadLeveler cluster. Also, to define the machine as strictly a submit-only machine, set this keyword to false. STARTD Location of the startd executable (LoadL_startd). Syntax: STARTD = directory Default value: $(BIN)/LoadL_startd For more information related to using this keyword, see “How LoadLeveler daemons process jobs” on page 8. STARTD_COREDUMP_DIR Local directory for storing LoadL_startd core dump files. Syntax: STARTD_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. STARTD_DGRAM_PORT Specifies the port number used when connecting to the daemon. Syntax: STARTD_DGRAM_PORT = port number Default value: The default is 9615. For more information related to using this keyword, see “Defining network characteristics” on page 47. STARTD_RUNS_HERE = true | false Specifies whether the startd daemon runs on the host. If you do not want to run the startd daemon, specify false. Syntax: STARTD_RUNS_HERE = true | false Default value: true STARTD_STREAM_PORT Specifies the port number used when connecting to the daemon. Syntax: STARTD_STREAM_PORT = port number Default value: The default is 9611. For more information related to using this keyword, see “Defining network characteristics” on page 47. STARTER Location of the starter executable (LoadL_starter). Syntax: STARTER = directory 306 TWS LoadLeveler: Using and Administering
  • 327. Default value: $(BIN)/LoadL_starter For more information related to using this keyword, see “How LoadLeveler daemons process jobs” on page 8. STARTER_COREDUMP_DIR Local directory for storing LoadL_starter coredump files. Syntax: STARTER_COREDUMP_DIR = directory Default value: The /tmp directory. For more information related to using this keyword, see “Specifying file and directory locations” on page 47. SUBMIT_FILTER Specifies the program you want to run to filter a job script when the job is submitted. Syntax: SUBMIT_FILTER = full_path_to_executable Where full_path_to_executable is called with the job command file as the standard input. The standard output is submitted to LoadLeveler. If the program returns with a nonzero exit code, the job submission is canceled. A submit filter can only make changes to LoadLeveler job command file keyword statements. Default value: No default value is set. Multicluster use: In a multicluster environment, if you specified a valid cluster list with either the llsubmit -X option or the ll_cluster API, then the SUBMIT_FILTER will instead be invoked with a modified job command file that contains a cluster_list keyword generated from either the llsubmit -X option or the ll_cluster API. The modified job command file will contain an inserted # @ cluster_list = cluster statement just prior to the first # @ queue statement. This cluster_list statement takes precedence and overrides all previous specifications of any cluster_list statements from the original job command file. Example: SUBMIT_FILTER in a multicluster environment The following job command file, job.cmd, requests to be run remotely on cluster1: #!/bin/sh # @ cluster_list = cluster1 # @ error = job1.$(Host).$(Cluster).$(Process).err # @ output = job1.$(Host).$(Cluster).$(Process).out # @ queue After issuing llsubmit -X cluster2 job.cmd, the modified job command file statements will be run on cluster2: #!/bin/sh # @ cluster_list = cluster1 # @ error = job1.$(Host).$(Cluster).$(Process).err # @ output = job1.$(Host).$(Cluster).$(Process).out # @ cluster_list = cluster2 # @ queue For more information related to using this keyword, see “Filtering a job script” on page 76. Chapter 12. Configuration file reference 307
  • 328. SUSPEND Determines whether running jobs should be suspended. Syntax: SUSPEND: expression that evaluates to T or F (true or false) When T, LoadLeveler temporarily suspends jobs currently running on the machine. Suspended LoadLeveler jobs will either be continued or vacated. This keyword is not supported for parallel jobs. Default value: No default value is set. For information about time-related variables that you may use for this keyword, see “Variables to use for setting times” on page 320. SYSPRIO System priority expression. Syntax: SYSPRIO : expression You can use the following LoadLeveler variables to define the SYSPRIO expression: v ClassSysprio v GroupQueuedJobs v GroupRunningJobs v GroupSysprio v GroupTotalJobs v GroupTotalShares v GroupUsedBgShares v GroupUsedShares v JobIsBlueGene v QDate | v UserHoldTime v UserPrio v UserQueuedJobs v UserRunningJobs v UserSysprio v UserTotalJobs v UserTotalShares v UserUsedBgShares v UserUsedShares For detailed descriptions of these variables, see “LoadLeveler variables” on page 314. Default value: 0 (zero) Note: 1. The SYSPRIO keyword is valid only on the machine where the central manager is running. Using this keyword in a local configuration file has no effect. 2. It is recommended that you do not use UserPrio in the SYSPRIO expression, since user jobs are already ordered by UserPrio. 3. The string SYSPRIO can be used as both the name of an expression (SYSPRIO: value) and the name of a variable (SYSPRIO = value). To specify the expression to be used to calculate job priority you must use the syntax for the SYSPRIO expression. If the variable is 308 TWS LoadLeveler: Using and Administering
  • 329. mistakenly used for the SYSPRIO expression, which requires a colon (:) after the name, the job priority value will always be 0 because the SYSPRIO expression has not been defined. 4. When the UserRunningJobs, GroupRunningJobs, UserQueuedJobs, GroupQueuedJobs, UserTotalJobs, GroupTotalJobs, GroupTotalShares, GroupUsedShares, UserTotalShares, UserUsedShares, GroupUsedBgShares, JobIsBlueGene, and UserUsedBgShares variables are used to prioritize the queue based on current usage, you should also set NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL so that the priorities are adjusted according to current usage rather than usage only at submission time. Examples: v Example 1 This example creates a FIFO job queue based on submission time: SYSPRIO : 0 - (QDate) v Example 2 This example accounts for Class, User, and Group system priorities: SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate) v Example 3 This example orders the queue based on the number of jobs a user is currently running. The user who has the fewest jobs running is first in the queue. You should set NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL in conjunction with this SYSPRIO expression. SYSPRIO : 0 - UserRunningJobs v Example 4 This example shows one possible way to set up the SYSPRIO expression for fair share scheduling. For those jobs whose owner has no unused shares ($(UserHasShares)= 0), that job priority depends only on QDate, making it a simple FIFO queue as in Example 1. For those jobs whose owner has unused shares ($(UserHasShares)= 1), job priority depends not only on QDate, but also on a uniform boost of 31 536 000 (the equivalent to the job being submitted one year earlier). These jobs still have priority differences because of submit time differences. It is like forming two priority tiers: the higher priority tier for jobs with unused shares and the lower priority tier for jobs without unused shares. SYSPRIO: 31536000 * $(UserHasShares) - QDate v Example 5 This example divides the jobs into three priority tiers: – Those jobs whose owner and group both have unused shares are at the top tier – Those jobs whose owner or group has unused shares are at the middle tier – Those jobs whose owner and group both have no shares remaining are at the bottom tier A user can submit two jobs to two different groups, the first job to a group with shares remaining and the second job to a group without any unused shares. If the user has unused shares, the first job will belong to the top tier and the second job will belong to the middle tier. If the user has no shares remaining, the first job will belong to the middle tier and the second job will Chapter 12. Configuration file reference 309