Next: 3.7 Security In Condor Up: 3. Administrators' Manual Previous: 3.5 User Priorities Contents Index

Subsections

3.6 Startd Policy Configuration

This section describes the configuration of the condor_ startd to implement the desired policy for when remote jobs should start, be suspended, (possibly) resumed, vacate (with a checkpoint) or be killed (no checkpoint). This policy is the heart of Condor's balancing act between the needs and wishes of resource owners (machine owners) and resource users (people submitting their jobs to Condor). Please read this section carefully if you plan to change any of the settings described here, as a wrong setting can have a severe impact on either the owners of machines in your pool (they may ask to be removed from the pool entirely) or the users of your pool (they may stop using Condor).

Before we get into the details, there are a few things to note:

Much of this section refers to ClassAd expressions. You probably want to read through section 4.1 on ClassAd expressions before continuing with this.
If you are primarily familiar with the version 6.0 policy expressions and what they do, read section 3.6.10 on page . This section explains the differences between the version 6.0 policy expressions and later versions.
If you are defining the policy for an SMP machine (a multi-CPU machine), also read section 3.10.6 for specific information on configuring the condor_ startd for SMP machines. Each virtual machine represented by the condor_ startd on an SMP machine has its own state and activity (as described below). In the future, each virtual machine will be able to have its own individual policy expressions defined. Within this manual section, the word ``machine'' refers to an individual virtual machine within an SMP machine.

To define your policy, set expressions in the configuration file (see section 3.3 on Configuring Condor for an introduction to Condor's configuration files). The expressions are evaluated in the context of the machine's ClassAd and a job ClassAd. The expressions can therefore reference attributes from either ClassAd. Listed in this section are both the attributes that are included in the machine's ClassAd and the attributes that are included in a job ClassAd. The START expression is explained. It describes the conditions that must be met for a machine to start a job. The RANK expression is described. It allows the specification of the kinds of jobs a machine prefers to run. A final discussion details how the condor_ startd daemon works. Included are the machine states and activities, to give an idea of what is possible in policy decisions. Two example policy settings are presented.

3.6.1 Startd ClassAd Attributes

The condor_ startd daemon represents the machine on which it is running to the Condor pool. The daemon publishes characteristics about the machine in the machine's ClassAd to aid matchmaking with resource requests. The values of these attributes may be listed by using the command: condor_ status -l hostname. On an SMP machine, the condor_ startd will break the machine up and advertise it as separate virtual machines, each with its own name and ClassAd. The attributes themselves and what they represent are described below:

Activity

: String which describes Condor job activity on the machine. Can have one of the following values:

"Idle": : There is no job activity
"Busy": : A job is busy running
"Suspended": : A job is currently suspended
"Vacating": : A job is currently checkpointing
"Killing": : A job is currently being killed
"Benchmarking": : The startd is running benchmarks

Arch

: String with the architecture of the machine. Typically one of the following:

"INTEL": : Intel x86 CPU (Pentium, Xeon, etc).
"ALPHA": : Digital Alpha CPU
"SGI": : Silicon Graphics MIPS CPU
"SUN4u": : Sun UltraSparc CPU
"SUN4x": : A Sun Sparc CPU other than an UltraSparc, i.e. sun4m or sun4c CPU found in older Sparc workstations such as the Sparc 10, Sparc 20, IPC, IPX, etc.
"HPPA1": : Hewlett Packard PA-RISC 1.x CPU (i.e. PA-RISC 7000 series CPU) based workstation
"HPPA2": : Hewlett Packard PA-RISC 2.x CPU (i.e. PA-RISC 8000 series CPU) based workstation

ClockDay

: The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.

ClockMin

: The number of minutes passed since midnight.

CondorLoadAvg

: The portion of the load average generated by Condor (either from remote jobs or running benchmarks).

ConsoleIdle

: The number of seconds since activity on the system console keyboard or console mouse has last been detected.

Cpus

: Number of CPUs in this machine, i.e. 1 = single CPU machine, 2 = dual CPUs, etc.

CurrentRank

: A float which represents this machine owner's affinity for running the Condor job which it is currently hosting. If not currently hosting a Condor job, CurrentRank is -1.0.

Disk

: The amount of disk space on this machine available for the job in kbytes ( e.g. 23000 = 23 megabytes ). Specifically, this is the amount of disk space available in the directory specified in the Condor configuration files by the EXECUTE macro, minus any space reserved with the RESERVED_DISK macro.

EnteredCurrentActivity

: Time at which the machine entered the current Activity (see Activity entry above). On all platforms (including NT), this is measured in the number of seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).

FileSystemDomain

: A ``domain'' name configured by the Condor administrator which describes a cluster of machines which all access the same, uniformly-mounted, networked file systems usually via NFS or AFS. This is useful for Vanilla universe jobs which require remote file access.

KeyboardIdle

: The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected. Unlike ConsoleIdle, KeyboardIdle also takes activity on pseudo-terminals into account (i.e. virtual ``keyboard'' activity from telnet and rlogin sessions as well). Note that KeyboardIdle will always be equal to or less than ConsoleIdle.

KFlops

: Relative floating point performance as determined via a Linpack benchmark.

LastHeardFrom

: Time when the Condor central manager last received a status update from this machine. Expressed as seconds since the epoch (integer value). Note: This attribute is only inserted by the central manager once it receives the ClassAd. It is not present in the condor_ startd copy of the ClassAd. Therefore, you could not use this attribute in defining condor_ startd expressions (and you would not want to).

LoadAvg

: A floating point number with the machine's current load average.

Machine

: A string with the machine's fully qualified hostname.

Memory

: The amount of RAM in megabytes.

Mips

: Relative integer performance as determined via a Dhrystone benchmark.

MyType

: The ClassAd type; always set to the literal string "Machine".

Name

: The name of this resource; typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the condor_ startd will divide the CPUs up into separate virtual machines, each with with a unique name. These names will be of the form ``vm#@full.hostname'', for example, ``vm1@vulture.cs.wisc.edu'', which signifies virtual machine 1 from vulture.cs.wisc.edu.

OpSys

: String describing the operating system running on this machine. For Condor Version 6.6.0 typically one of the following:

"HPUX10": : for HPUX 10.20
"IRIX6": : for IRIX 6.2, 6.3, or 6.4
"IRIX65": : for IRIX 6.5
"LINUX": : for LINUX 2.0.x or LINUX 2.2.x kernel systems
"OSF1": : for Digital Unix 4.x
"SOLARIS251"
"SOLARIS26"
"SOLARIS27"
"SOLARIS28"
"WINNT40": : for Windows NT 4.0

Requirements

: A boolean, which when evaluated within the context of the machine ClassAd and a job ClassAd, must evaluate to TRUE before Condor will allow the job to use this machine.

StartdIpAddr

: String with the IP and port address of the condor_ startd daemon which is publishing this machine ClassAd.

State

: String which publishes the machine's Condor state. Can be:

"Owner": : The machine owner is using the machine, and it is unavailable to Condor.
"Unclaimed": : The machine is available to run Condor jobs, but a good match is either not available or not yet found.
"Matched": : The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.
"Claimed": : The machine is claimed by a remote condor_ schedd and is probably running a job.
"Preempting": : A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.

TargetType

: Describes what type of ClassAd to match with. Always set to the string literal "Job", because machine ClassAds always want to be matched with jobs, and vice-versa.

UidDomain

: a domain name configured by the Condor administrator which describes a cluster of machines which all have the same passwd file entries, and therefore all have the same logins.

VirtualMachineID

: For SMP machines, the integer that identifies the VM. The value will be X for the VM with

name="vmX@full.hostname"

For non-SMP machines with one virtual machine, the value will be 1.

VirtualMemory

: The amount of currently available virtual memory (swap space) expressed in kbytes.

In addition, there are a few attributes that are automatically inserted into the machine ClassAd whenever a resource is in the Claimed state:

ClientMachine: : The hostname of the machine that has claimed this resource
CurrentRank: : The value of the RANK expression when evaluated against the ClassAd of the ``current'' job using this machine. If the resource has been claimed but no job is running, the ``current'' job ClassAd is the one that was used when claiming the resource. If a job is currently running, that job's ClassAd is the ``current'' one. If the resource is between jobs, the ClassAd of the last job that was run is used for CurrentRank.
RemoteOwner: : The name of the user who originally claimed this resource.
RemoteUser: : The name of the user who is currently using this resource. In general, this will always be the same as the RemoteOwner, but in some cases, a resource can be claimed by one entity that hands off the resource to another entity which uses it. In that case, RemoteUser would hold the name of the entity currently using the resource, while RemoteOwner would hold the name of the entity that claimed the resource.

There are a few attributes that are only inserted into the machine ClassAd if a job is currently executing. If the resource is claimed but no job are running, none of these attributes will be defined.

JobId

: The job's identifier (for example,

152.3

), like you would see in condor_ q on the submitting machine.

JobStart

: The timestamp of when the job began executing.

LastPeriodicCheckpoint

: If the job has performed a periodic checkpoint, this attribute will be defined and will hold the timestamp of when the last periodic checkpoint was begun. If the job has yet to perform a periodic checkpoint, or cannot checkpoint at all, the LastPeriodicCheckpoint attribute will not be defined.

Finally, the single attribute, CurrentTime, is defined by the ClassAd environment.

CurrentTime: : Evaluates to the the number of seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).

3.6.2 Job ClassAd Attributes

CkptArch

: String describing the architecture of the machine where this job last checkpointed. If the job has never checkpointed, this attribute is UNDEFINED.

CkptOpSys

: String describing the operating system of the machine where this job last checkpointed. If the job has never checkpointed, this attribute is UNDEFINED.

ClusterId

: Integer cluster identifier for this job. A ``cluster'' is a group of jobs that were submitted together. Each job has its own unique job identifier within the cluster, but shares a common cluster identifier.

Cmd

: The path to and the file name of the job to be executed.

CompletionDate

: The time when the job completed, or the value 0 if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

CumulativeSuspensionTime

: A running total of the number of seconds the job has spent in suspension for the life of the job.

ExecutableSize

: Size of the executable in kbytes.

ExitBySignal

: An attribute that is True When a user job exits via a signal and false otherwise. It is available for use in all universes except the globus universe.

ExitCode

: When a user job exits by means other than a signal, this is the exit return code of the user job. It is available for use in all universes except the globus universe.

ExitSignal

: When a user job exits by means of an unhandled signal, this attribute takes on the numeric value of the signal. It is available for use in all universes except the globus universe.

ExitStatus

: The way that Condor previously dealt with a job's exit status. This attribute should no longer be used. It is not always accurate in heterogeneous pools, or if the job exited with a signal. Instead, see the attributes: ExitBySignal, ExitCode, and ExitSignal.

ImageSize

: Estimate of the memory image size of the job in kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image).

JobPrio

: Integer priority for this job, set by condor_ submit or condor_ prio. The default value is 0. The higher the number, the worse the priority.

JobStartDate

: Time at which the job first began running. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

JobStatus

: Integer which indicates the current status of the job, where 1 = Idle, 2 = Running, 3 = Removed, 4 = Completed, and 5 = Held.

JobUniverse

: Integer which indicates the job universe, where 1 = Standard, 4 = PVM, 5 = Vanilla, 7 = Scheduler, 8 = MPI, 9 = Globus, and 10 = Java.

LastCkptServer

: Hostname of the last checkpoint server used by this job. When a pool is using multiple checkpoint servers, this tells the job where to find its checkpoint file.

LastCkptTime

: Time at which the job last performed a successful checkpoint. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

LastSuspensionTime

: Time at which the job last performed a successful suspension. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

LastVacateTime

: Time at which the job was last evicted from a remote workstation. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

NumCkpts

: A count of the number of checkpoints written by this job during its lifetime.

NumRestarts

: A count of the number of restarts from a checkpoint attempted by this job during its lifetime.

NiceUser

: Boolean value which indicates whether this is a nice-user job.

Owner

: String describing the user who submitted this job.

ProcId

: Integer process identifier for this job. In a cluster of many jobs, each job will have the same ClusterId but will have a unique ProcId.

RemoteIwd

: The path to the directory in which a job is to be executed on a remote machine.

RemoteSysCpu

: The total number of seconds of system CPU time (the time spent at system calls) the job used on remote machines.

RemoteUserCpu

: The total number of seconds of user CPU time the job used on remote machines.

RemoteWallClockTime

: Cumulative number of seconds the job has been allocated a machine. This also includes time spent in suspension (if any), so the total real time spent running is

RemoteWallClockTime - CumulativeSuspensionTime

Note that this number does not get reset to zero when a job is forced to migrate from one machine to another.

TotalSuspensions

: A count of the number of times this job has been suspended during its lifetime.

QDate

: Time at which the job was submitted to the job queue. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

NumJobMatches

: An integer that is incremented by the condor_ schedd each time the job is matched with a resource ad by the negotiator.

NumGlobusSubmits

: An integer that is incremented each time the condor_ gridmanager receives confirmation of a successful job submission into Globus.

NumSystemHolds

: An integer that is incremented each time Condor-G places a job on hold due to some sort of error condition. This counter is useful, since Condor-G will always place a job on hold when it gives up on some error condition. Note that if the user places the job on hold using the condor_ hold command, this attribute is not incremented.

HoldReason

: A string containing a human-readable message about why a job is on hold. This is the message that will be displayed in response to the command condor\_q -hold. It can be used to determine if a job should be released or not.

ReleaseReason

: A string containing a human-readable message about why the job was released from hold.

EnteredCurrentStatus

: An integer containing the epoch time of when the job entered into its current status So for example, if the job is on hold, the ClassAd expression

    CurrentTime - EnteredCurrentStatus

will equal the number of seconds that the job has been on hold.

LastMatchTime

: An integer containing the epoch time when the job was last successfully matched with a resource (gatekeeper) Ad.

LastRejMatchTime

: An integer containing the epoch time when Condor-G last tried to find a match for the job, but failed to do so.

LastRejMatchReason

: If, at any point in the past, this job failed to match with a resource ad, this attribute will contain a string with a human-readable message about why the match failed.

StreamOut

: An attribute utilized only for globus universe jobs. The default value is True. If True, and TransferOut is True, then job output is streamed back to the submit machine, instead of doing the transfer (as a whole) after the job completes. If False, then job output is transferred back to the submit machine (as a whole) after the job completes. If TransferOut is False, then this job attribute is ignored.

StreamErr

: An attribute utilized only for globus universe jobs. The default value is True. If True, and TransferErr is True, then standard error is streamed back to the submit machine, instead of doing the transfer (as a whole) after the job completes. If False, then standard error is transfered back to the submit machine (as a whole) after the job completes. If TransferErr is False, then this job attribute is ignored.

TransferOut

: An attribute utilized only for globus universe jobs. The default value is True. If True, then the output from the job is transferred from the remote machine back to the submit machine. The name of the file after transfer is the file referred to by job attribute Out. If False, no transfer takes place (remote to submit machine), and the name of the file is the file referred to by job attribute Out.

TransferIn

: An attribute utilized only for globus universe jobs. The default value is True. If True, then the job input is transferred from the submit machine to the remote machine. The name of the file that is transferred is given by the job attribute In. If False, then the job's input is taken from a file on the remote machine (pre-staged), and the name of the file is given by the job attribute In.

TransferErr

: An attribute utilized only for globus universe jobs. The default value is True. If True, then the error output from the job is transferred from the remote machine back to the submit machine. The name of the file after transfer is the file referred to by job attribute Err. If False, no transfer takes place (remote to submit machine), and the name of the file is the file referred to by job attribute Err.

TransferExecutable

: An attribute utilized only for globus universe jobs. The default value is True. If True, then the job executable is transferred from the submit machine to the remote machine. The name of the file (on the submit machine) that is transferred is given by the job attribute Cmd. If False, no transfer takes place, and the name of the file used (on the remote machine) will be as given in the job attribute Cmd.

3.6.3 The `START` expression

The most important expression to the condor_ startd is the START expression. This expression describes the conditions that must be met for a machine to run a job. This expression can reference attributes in the machine's ClassAd (such as KeyboardIdle and LoadAvg) or attributes in a job ClassAd (such as Owner, Imagesize, and Cmd, the name of the executable the job will run). The value after START expression evaluation plays a crucial role in determining the state and activity of a machine.

The Requirements expression is used for matching machines with jobs. The condor_ startd defines the Requirements expression by using the START expression. In situations where a machine wants to make itself unavailable for further matches, the Requirements expression is set to FALSE. When the START expression locally evaluates to TRUE, the machine advertises the Requirements expression as TRUE and does not publish the START expression.

Normally, the expressions in the machine ClassAd are evaluated against certain request ClassAds in the condor_ negotiator to see if there is a match, or against whatever request ClassAd currently has claimed the machine. However, by locally evaluating an expression, the machine only evaluates the expression against its own ClassAd. If an expression cannot be locally evaluated (because it references other expressions that are only found in a request ad, such as Owner or Imagesize), the expression is (usually) undefined. See section 4.1 for specifics on how undefined terms are handled in ClassAd expression evaluation.

NOTE: If you have machines with lots of real memory and swap space such that the only scarce resource is CPU time, consider using defining JOB_RENICE_INCREMENT so that Condor starts jobs on the machine with low priority. Then, further configure to set up the machines with:

        START = True
        SUSPEND = False
        PREEMPT = False
        KILL = False

In this way, Condor jobs always run and can never be kicked off. However, because they would run with ``nice priority'', interactive response on the machines will not suffer. You probably would not notice Condor was running the jobs, assuming you had enough free memory for the Condor jobs that there was little swapping.

3.6.4 The `RANK` expression

A machine may be configured to prefer certain jobs over others using the RANK expression. It is an expression, like any other in a machine ClassAd. It can reference any attribute found in either the machine ClassAd or a request ad (normally, in fact, it references things in the request ad). The most common use of this expression is likely to configure a machine to prefer to run jobs from the owner of that machine, or by extension, a group of machines to prefer jobs from the owners of those machines.

For example, imagine there is a small research group with 4 machines called tenorsax, piano, bass, and drums. These machines are owned by the 4 users coltrane, tyner, garrison, and jones, respectively.

Assume that there is a large Condor pool in your department, but you spent a lot of money on really fast machines for your group. You want to implement a policy that gives priority on your machines to anyone in your group. To achieve this, set the RANK expression on your machines to reference the Owner attribute and prefer requests where that attribute matches one of the people in your group as in

        RANK = Owner == "coltrane" || Owner == "tyner" \
               || Owner == "garrison" || Owner == "jones"

The RANK expression is evaluated as a floating point number. However, like in C, boolean expressions evaluate to either 1 or 0 depending on if they are TRUE or FALSE. So, if this expression evaluated to 1 (because the remote job was owned by one of the preferred users), it would be a larger value than any other user (for whom the expression would evaluate to 0).

A more complex RANK expression has the same basic set up, where anyone from your group has priority on your machines. Its difference is that the machine owner has better priority on their own machine. To set this up for Jimmy Garrison, place the following entry in Jimmy Garrison's local configuration file bass.local:

        RANK = (Owner == "coltrane") + (Owner == "tyner") \
               + ((Owner == "garrison") * 10) + (Owner == "jones")

NOTE: The parentheses in this expression are important, because ``+'' operator has higher default precedence than ``==''.

The use of ``+'' instead of ``| | '' allows us to distinguish which terms matched and which ones didn't. If anyone not in the John Coltrane quartet was running a job on the machine called bass, the RANK would evaluate numerically to 0, since none of the boolean terms evaluates to 1, and 0+0+0+0 still equals 0.

Suppose Elvin Jones submits a job. His job would match this machine (assuming the START was True for him at that time) and the RANK would numerically evaluate to 1. Therefore, Elvin would preempt the Condor job currently running. Assume that later Jimmy submits a job. The RANK evaluates to 10, since the boolean that matches Jimmy gets multiplied by 10. Jimmy would preempt Elvin, and Jimmy's job would run on Jimmy's machine.

The RANK expression is not required to reference the Owner of the jobs. Perhaps there is one machine with an enormous amount of memory, and others with not much at all. You can configure your large-memory machine to prefer to run jobs with larger memory requirements:

        RANK = ImageSize

That's all there is to it. The bigger the job, the more this machine wants to run it. It is an altruistic preference, always servicing the largest of jobs, no matter who submitted them. A little less altruistic is John's RANK that prefers his jobs over those with the largest Imagesize:

        RANK = (Owner == "coltrane" * 1000000000000) + Imagesize

This RANK breaks if a job is submitted with an image size of more 10¹² Kbytes. However, with that size, this RANK expression preferring that job would not be Condors only problem!

3.6.5 Machine States

A machine is assigned a state by Condor. The state depends on whether or not the machine is available to run Condor jobs, and if so, what point in the negotiations has been reached. The possible states are

Owner

The machine is being used by the machine owner, and/or is not available to run Condor jobs. When the machine first starts up, it begins in this state.

Unclaimed

The machine is available to run Condor jobs, but it is not currently doing so.

Matched

The machine is available to run jobs, and it has been matched by the negotiator with a specific schedd. That schedd just has not yet claimed this machine. In this state, the machine is unavailable for further matches.

Claimed

The machine has been claimed by a schedd.

Preempting

The machine was claimed by a schedd, but is now preempting that claim for one of the following reasons.

the owner of the machine came back
another user with higher priority has jobs waiting to run
another request that this resource would rather serve was found

Figure 3.3 shows the states and the possible transitions between the states.

**Figure 3.3:** Machine States
$\includegraphics{admin-man/machine-states.eps}$

3.6.6 Machine Activities

Within some machine states, activities of the machine are defined. The state has meaning regardless of activity. Differences between activities are significant. Therefore, a ``state/activity'' pair describes a machine. The following list describes all the possible state/activity pairs.

Owner

Idle

This is the only activity for Owner state. As far as Condor is concerned the machine is Idle, since it is not doing anything for Condor.
Unclaimed

Idle

This is the normal activity of Unclaimed machines. The machine is still Idle in that the machine owner is willing to let Condor jobs run, but Condor is not using the machine for anything.

Benchmarking

The machine is running benchmarks to determine the speed on this machine. This activity only occurs in the Unclaimed state. How often the activity occurs is determined by the RunBenchmarks expression.
Matched

Idle

When Matched, the machine is still Idle to Condor.
Claimed

Idle

In this activity, the machine has been claimed, but the schedd that claimed it has yet to activate the claim by requesting a condor_ starter to be spawned to service a job.

Busy

Once a condor_ starter has been started and the claim is active, the machine moves to the Busy activity to signify that it is doing something as far as Condor is concerned.

Suspended

If the job is suspended by Condor, the machine goes into the Suspended activity. The match between the schedd and machine has not been broken (the claim is still valid), but the job is not making any progress and Condor is no longer generating a load on the machine.
Preempting The preempting state is used for evicting a Condor job from a given machine. When the machine enters the Preempting state, it checks the WANT_VACATE expression to determine its activity.

Vacating

In the Vacating activity, the job that was running is in the process of checkpointing. As soon as the checkpoint process completes, the machine moves into either the Owner state or the Claimed state, depending on the reason for its preemption.

Killing

Killing means that the machine has requested the running job to exit the machine immediately, without checkpointing.

Figure 3.4 on page gives the overall view of all machine states and activities and shows the possible transitions from one to another within the Condor system. Each transition is labeled with a number on the diagram, and transition numbers referred to in this manual will be bold.

**Figure 3.4:** Machine States and Activities
$\includegraphics{admin-man/machine-activities.eps}$

Various expressions are used to determine when and if many of these state and activity transitions occur. Other transitions are initiated by parts of the Condor protocol (such as when the condor_ negotiator matches a machine with a schedd). The following section describes the conditions that lead to the various state and activity transitions.

3.6.7 State and Activity Transitions

This section traces through all possible state and activity transitions within a machine and describes the conditions under which each one occurs. Whenever a transition occurs, Condor records when the machine entered its new activity and/or new state. These times are often used to write expressions that determine when further transitions occurred. For example, enter the Killing activity if a machine has been in the Vacating activity longer than a specified amount of time.

3.6.7.1 Owner State

When the startd is first spawned, the machine it represents enters the Owner state. The machine remains in the Owner state while the expression IsOwner is TRUE. If the IsOwner expression is FALSE, then the machine transitions to the Unclaimed state. The default value for the IsOwner expression is

START =?= FALSE

So, the machine will remain in the Owner state as long as the START expression locally evaluates to FALSE. If the START locally evaluates to TRUE or cannot be locally evaluated (it evaluates to UNDEFINED), transition 1 occurs and the machine enters the Unclaimed state.

The Owner state represents a resource that is in use by its interactive owner (for example, if the keyboard is being used). The Unclaimed state represents a resource that is neither in use by its interactive user, nor the Condor system. From Condor's point of view, there is little difference between the Owner and Unclaimed states. In both cases, the resource is not currently in use by the Condor system. However, if a job matches the resource's START expression, the resource is available to run a job, regardless of if it is in the Owner or Unclaimed state. The only differences between the two states are how the resource shows up in condor_ status and other reporting tools, and the fact that Condor will not run benchmarking on a resource in the Owner state. As long as the IsOwner expression is TRUE, the machine is in the Owner State. When the IsOwner expression is FALSE, the machine goes into the Unclaimed State.

Here is an example that assumes that an IsOwner expression is not present in the configuration. If the START expression is

START = KeyboardIdle > 15 * $(MINUTE) && Owner == "coltrane"

and if KeyboardIdle is 34 seconds, then the machine would remain in the Owner state. Owner is undefined, and anything && FALSE is FALSE.

If, however, the START expression is

        START = KeyboardIdle > 15 * $(MINUTE) || Owner == "coltrane"

and KeyboardIdle is 34 seconds, then the machine leaves the Owner state and becomes Unclaimed. This is because FALSE || UNDEFINED is UNDEFINED. So, while this machine is not available to just anybody, if user coltrane has jobs submitted, the machine is willing to run them. Any other user's jobs have to wait until KeyboardIdle exceeds 15 minutes. However, since coltrane might claim this resource, but has not yet, the machine goes to the Unclaimed state.

While in the Owner state, the startd polls the status of the machine every UPDATE_INTERVAL to see if anything has changed that would lead it to a different state. This minimizes the impact on the Owner while the Owner is using the machine. Frequently waking up, computing load averages, checking the access times on files, computing free swap space take time, and there is nothing time critical that the startd needs to be sure to notice as soon as it happens. If the START expression evaluates to TRUE and five minutes pass before the startd notices, that's a drop in the bucket of high-throughput computing.

The machine can only transition to the Unclaimed state from the Owner state. It only does so when the START expression no longer locally evaluates to FALSE. In general, if the START expression locally evaluates to FALSE at any time, the machine will either transition directly to the Owner state or to the Preempting state on its way to the Owner state, if there is a job running that needs preempting.

3.6.7.2 Unclaimed State

If the IsOwner expression becomes TRUE, then the machine returns to the Owner state. If the IsOwner expression becomes FALSE, then the machine remains in the Unclaimed state. If the IsOwner expression is not present in the configuration files, then the default value for the IsOwner expression is

START =?= FALSE

so that while in the Unclaimed state, if the START expression locally evaluates to FALSE, the machine returns to the Owner state by transition 2.

When in the Unclaimed state, the RunBenchmarks expression is relevant. If RunBenchmarks evaluates to TRUE while the machine is in the Unclaimed state, then the machine will transition from the Idle activity to the Benchmarking activity (transition 3) and perform benchmarks to determine MIPS and KFLOPS. When the benchmarks complete, the machine returns to the Idle activity (transition 4).

The startd automatically inserts an attribute, LastBenchmark, whenever it runs benchmarks, so commonly RunBenchmarks is defined in terms of this attribute, for example:

        BenchmarkTimer = (CurrentTime - LastBenchmark)
        RunBenchmarks = $(BenchmarkTimer) >= (4 * $(HOUR))

Here, a macro, BenchmarkTimer is defined to help write the expression. This macro holds the time since the last benchmark, so when this time exceeds 4 hours, we run the benchmarks again. The startd keeps a weighted average of these benchmarking results to try to get the most accurate numbers possible. This is why it is desirable for the startd to run them more than once in its lifetime.

NOTE: LastBenchmark is initialized to 0 before benchmarks have ever been run. So, if you want the startd to run benchmarks as soon as the machine is Unclaimed (if it hasn't done so already), include a term for LastBenchmark as in the example above.

NOTE: If RunBenchmarks is defined and set to something other than FALSE, the startd will automatically run one set of benchmarks when it first starts up. To disable benchmarks, both at startup and at any time thereafter, set RunBenchmarks to FALSE or comment it out of the configuration file.

From the Unclaimed state, the machine can go to three other possible states: Owner (transition 2), Matched, or Claimed/Idle. Once the condor_ negotiator matches an Unclaimed machine with a requester at a given schedd, the negotiator sends a command to both parties, notifying them of the match. If the schedd receives that notification and initiates the claiming procedure with the machine before the negotiator's message gets to the machine, the Match state is skipped, and the machine goes directly to the Claimed/Idle state (transition 5). However, normally the machine will enter the Matched state (transition 6), even if it is only for a brief period of time.

3.6.7.3 Matched State

The Matched state is not very interesting to Condor. Noteworthy in this state is that the machine lies about its START expression while in this state and says that Requirements are false to prevent being matched again before it has been claimed. Also interesting is that the startd starts a timer to make sure it does not stay in the Matched state too long. The timer is set with the MATCH_TIMEOUT configuration file macro. It is specified in seconds and defaults to 300 (5 minutes). If the schedd that was matched with this machine does not claim it within this period of time, the machine gives up, and goes back into the Owner state via transition 7. It will probably leave the Owner state right away for the Unclaimed state again and wait for another match.

At any time while the machine is in the Matched state, if the START expression locally evaluates to FALSE, the machine enters the Owner state directly (transition 7).

If the schedd that was matched with the machine claims it before the MATCH_TIMEOUT expires, the machine goes into the Claimed/Idle state (transition 8).

3.6.7.4 Claimed State

The Claimed state is certainly the most complex state. It has the most possible activities and the most expressions that determine its next activities. In addition, the condor_ checkpoint and condor_ vacate commands affect the machine when it is in the Claimed state. In general, there are two sets of expressions that might take effect. They depend on the universe of the request: standard or vanilla. The standard universe expressions are the normal expressions. For example:

        WANT_SUSPEND            = True
        WANT_VACATE             = $(ActivationTimer) > 10 * $(MINUTE)
        SUSPEND                 = $(KeyboardBusy) || $(CPUBusy)
        ...

The vanilla expressions have the string``_VANILLA'' appended to their names. For example:

        WANT_SUSPEND_VANILLA    = True
        WANT_VACATE_VANILLA     = True
        SUSPEND_VANILLA         = $(KeyboardBusy) || $(CPUBusy)
        ...

Without specific vanilla versions, the normal versions will be used for all jobs, including vanilla jobs. In this manual, the normal expressions are referenced. The difference exists for the the resource owner that might want the machine to behave differently for vanilla jobs, since they cannot checkpoint. For example, owners may want vanilla jobs to remain suspended for longer than standard jobs.

While Claimed, the POLLING_INTERVAL takes effect, and the startd polls the machine much more frequently to evaluate its state.

If the machine owner starts typing on the console again, it is best to notice this as soon as possible to be able to start doing whatever the machine owner wants at that point. For SMP machines, if any virtual machine is in the Claimed state, the startd polls the machine frequently. If already polling one virtual machine, it does not cost much to evaluate the state of all the virtual machines at the same time.

In general, when the startd is going to take a job off a machine (usually because of activity on the machine that signifies that the owner is using the machine again), the startd will go through successive levels of getting the job out of the way. The first and least costly to the job is suspending it. This works for both standard and vanilla jobs. If suspending the job for a short while does not satisfy the machine owner (the owner is still using the machine after a specific period of time), the startd moves on to vacating the job. Vacating a job involves performing a checkpoint so that the work already completed is not lost. If even that does not satisfy the machine owner (usually because it is taking too long and the owner wants their machine back now), the final, most drastic stage is reached: killing. Killing is a quick death to the job, without a checkpoint. For vanilla jobs, vacating and killing are equivalent, although a vanilla job can request to have a specific softkill signal sent to it at vacate time so that the job itself can perform application-specific checkpointing.

The WANT_SUSPEND expression determines if the machine will evaluate the SUSPEND expression to consider entering the Suspended activity. The WANT_VACATE expression determines what happens when the machine enters the Preempting state. It will go to the Vacating activity or directly to Killing. If one or both of these expressions evaluates to FALSE, the machine will skip that stage of getting rid of the job and proceed directly to the more drastic stages.

When the machine first enters the Claimed state, it goes to the Idle activity. From there, it has two options. It can enter the Preempting state via transition 9 (if a condor_ vacate arrives, or if the START expression locally evaluates to FALSE), or it can enter the Busy activity (transition 10) if the schedd that has claimed the machine decides to activate the claim and start a job.

From Claimed/Busy, the machine can transition to three other state/activity pairs. The startd evaluates the WANT_SUSPEND expression to decide which other expressions to evaluate. If WANT_SUSPEND is TRUE, then the startd evaluates the SUSPEND expression. If SUSPEND is FALSE, then the startd will evaluate the PREEMPT expression and skip the Suspended activity entirely. By transition, the possible state/activity destinations from Claimed/Busy:

Claimed/Idle: If the starter that is serving a given job exits (for example because the jobs completes), the machine will go to Claimed/Idle (transition 11).
Preempting: If WANT_SUSPEND is FALSE and the PREEMPT expression is TRUE, the machine enters the Preempting state (transition 12). The other reason the machine would go from Claimed/Busy to Preempting is if the condor_ negotiator matched the machine with a ``better'' match. This better match could either be from the machine's perspective using the RANK Expression above, or it could be from the negotiator's perspective due to a job with a higher user priority. In this case, WANT_VACATE is assumed to be TRUE, and the machine transitions to Preempting/Vacating.
Claimed/Suspended: If both the WANT_SUSPEND and SUSPEND expressions evaluate to TRUE, the machine suspends the job (transition 13).

If a condor_ checkpoint command arrives, or the PeriodicCheckpoint expression evaluates to TRUE, there is no state change. The startd has no way of knowing when this process completes, so periodic checkpointing can not be another state. Periodic checkpointing remains in the Claimed/Busy state and appears as a running job.

From the Claimed/Suspended state, the following transitions may occur:

Claimed/Busy: If the CONTINUE expression evaluates to TRUE, the machine resumes the job and enters the Claimed/Busy state (transition 14).
Preempting: If the PREEMPT expression is TRUE, the machine will enter the Preempting state (transition 15).

3.6.7.5 Preempting State

The Preempting state is less complex than the Claimed state. There are two activities. Depending on the value of WANT_VACATE, a machine will be in the Vacating activity (if TRUE) or the Killing activity (if FALSE).

While in the Preempting state (regardless of activity) the machine advertises its Requirements expression as FALSE to signify that it is not available for further matches, either because it is about to transition to the Owner state, or because it has already been matched with one preempting match, and further preempting matches are disallowed until the machine has been claimed by the new match.

The main function of the Preempting state is to get rid of the starter associated with the resource. If the condor_ starter associated with a given claim exits while the machine is still in the Vacating activity, then the job successfully completed its checkpoint.

If the machine is in the Vacating activity, it keeps evaluating the KILL expression. As soon as this expression evaluates to TRUE, the machine enters the Killing activity (transition 16).

When the starter exits, or if there was no starter running when the machine enters the Preempting state (transition 9), the other purpose of the Preempting state is completed: notifying the schedd that had claimed this machine that the claim is broken.

At this point, the machine enters either the Owner state by transition 17 (if the job was preempted because the machine owner came back) or the Claimed/Idle state by transition 18 (if the job was preempted because a better match was found). The machine enters the Killing activity, and it starts a timer, the length of which is defined by the KILLING_TIMEOUT macro. This macro is defined in seconds and defaults to 30. If this timer expires and the machine is still in the Killing activity, something has gone seriously wrong with the condor_ starter and the startd tries to vacate the job immediately by sending SIGKILL to all of the condor_ starter's children, and then to the condor_ starter itself.

Once the starter is gone and the schedd that had claimed the machine is notified that the claim is broken, the machine will either enter the Owner state by transition 19 (if the job was preempted because the machine owner came back) or the Claimed/Idle state by transition 20 (if the job was preempted because a better match was found).

3.6.8 State/Activity Transition Expression Summary

This section is a summary of the information from the previous sections. It serves as a quick reference.

START: When TRUE, the machine is willing to spawn a remote Condor job.
RunBenchmarks: While in the Unclaimed state, the machine will run benchmarks whenever TRUE.
MATCH_TIMEOUT: If the machine has been in the Matched state longer than this value, it will transition to the Owner state.
WANT_SUSPEND: If TRUE, the machine evaluates the SUSPEND expression to see if it should transition to the Suspended activity. If FALSE, the machine look at the PREEMPT expression.
SUSPEND: If WANT_SUSPEND is TRUE, and the machine is in the Claimed/Busy state, it enters the Suspended activity if SUSPEND is TRUE.
CONTINUE: If the machine is in the Claimed/Suspended state, it enter the Busy activity if CONTINUE is TRUE.
PREEMPT: If the machine is either in the Claimed/Suspended activity, or is in the Claimed/Busy activity and WANT_SUSPEND is FALSE, the machine enters the Preempting state whenever PREEMPT is TRUE.
WANT_VACATE: This is checked only when the PREEMPT expression is TRUE and the machine enters the Preempting state. If WANT_VACATE is TRUE, the machine enters the Vacating activity. If it is FALSE, the machine will proceed directly to the Killing activity.
KILL: If the machine is the Preempting/Vacating state, it enters Preempting/Killing whenever KILL is TRUE.
KILLING_TIMEOUT: If the machine is in the Preempting/Killing state for longer than KILLING_TIMEOUT seconds, the startd sends a SIGKILL to the condor_ starter and all its children to try to kill the job as quickly as possible.
PERIODIC_CHECKPOINT: If the machine is in the Claimed/Busy state and PERIODIC_CHECKPOINT is TRUE, the user's job begins a periodic checkpoint.
RANK: If this expression evaluates to a higher number for a pending resource request than it does for the current request, the machine preempts the current request (enters the Preempting/Vacating state). When the preemption is complete, the machine enters the Claimed/Idle state with the new resource request claiming it.

3.6.9 Policy Settings

This section describes the default configuration policy and then provides examples of extensions to these policies.

3.6.9.1 Default Policy Settings

These settings are the default as shipped with Condor. They have been used for many years with no problems. The vanilla expressions are identical to the regular ones. (They are not listed here. If not defined, the standard expressions are used for vanilla jobs as well).

The following are macros to help write the expressions clearly.

StateTimer: Amount of time in the current state.
ActivityTimer: Amount of time in the current activity.
ActivationTimer: Amount of time the job has been running on this machine.
LastCkpt: Amount of time since the last periodic checkpoint.
NonCondorLoadAvg: The difference between the system load and the Condor load (the load generated by everything but Condor).
BackgroundLoad: Amount of background load permitted on the machine and still start a Condor job.
HighLoad: If the $(NonCondorLoadAvg) goes over this, the CPU is considered too busy, and eviction of the Condor job should start.
StartIdleTime: Amount of time the keyboard must to be idle before Condor will start a job.
ContinueIdleTime: Amount of time the keyboard must to be idle before resumption of a suspended job.
MaxSuspendTime: Amount of time a job may be suspended before more drastic measures are taken.
MaxVacateTime: Amount of time a job may be checkpointing before we give up kill it outright.
KeyboardBusy: A boolean expression that evaluates to TRUE when the keyboard is being used.
CPUIdle: A boolean expression that evaluates to TRUE when the CPU is idle.
CPUBusy: A boolean expression that evaluates to TRUE when the CPU is busy.
MachineBusy: The CPU or the Keyboard is busy.
CPUIsBusy: A boolean value set to the same value as CPUBusy .
CPUBusyTime: The value 0 if CPUBusy is False; the time in seconds since CPUBusy became True.

##  These macros are here to help write legible expressions:
MINUTE          = 60
HOUR            = (60 * $(MINUTE))
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)
LastCkpt        = (CurrentTime - LastPeriodicCheckpoint)

NonCondorLoadAvg        = (LoadAvg - CondorLoadAvg)
BackgroundLoad          = 0.3
HighLoad                = 0.5
StartIdleTime           = 15 * $(MINUTE)
ContinueIdleTime        = 5 * $(MINUTE)
MaxSuspendTime          = 10 * $(MINUTE)
MaxVacateTime           = 10 * $(MINUTE)

KeyboardBusy            = KeyboardIdle < $(MINUTE)
ConsoleBusy             = (ConsoleIdle  < $(MINUTE))
CPUIdle                = $(NonCondorLoadAvg) <= $(BackgroundLoad)
CPUBusy                = $(NonCondorLoadAvg) >= $(HighLoad)
KeyboardNotBusy         = ($(KeyboardBusy) == False)
MachineBusy             = ($(CPUBusy) || $(KeyboardBusy)

Macros are defined to want to suspend jobs (instead of killing them) in the case of jobs that use little memory, when the keyboard is not being used, and for vanilla universe and PVM universe jobs. We want to gracefully vacate jobs which have been running for more than 10 minutes or are vanilla universe or PVM universe jobs.

WANT_SUSPEND       = ( $(SmallJob) || $(KeyboardNotBusy) \
                       || $(IsPVM) || $(IsVanilla) )
WANT_VACATE        = ( $(ActivationTimer) > 10 * $(MINUTE) \
                       || $(IsPVM) || $(IsVanilla) )

Finally, definitions of the actual expressions. Start a job if the keyboard has been idle long enough and the load average is low enough OR the machine is currently running a Condor job. Note that Condor would only run one job at a time. It just may prefer to run a different job, as defined by the machine rank or user priorities.

START        = ( (KeyboardIdle > $(StartIdleTime)) \
                  && ( $(CPUIdle) || \
                       (State != "Unclaimed" && State != "Owner")) )

Suspend a job if the keyboard has been touched. Alternatively, suspend if the CPU has been busy for more than two minutes and the job has been running for more than 90 seconds.

SUSPEND         = ( $(KeyboardBusy) || \
                 ( (CpuBusyTime > 2 * $(MINUTE)) \
                    && $(ActivationTimer) > 90 ) )

Continue a suspended job if the CPU is idle, the Keyboard has been idle for long enough, and the job has been suspended more than 10 seconds.

CONTINUE        = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
                  && (KeyboardIdle > $(ContinueIdleTime)) )

There are two conditions that signal preemption. The first condition is if the job is suspended, but it has been suspended too long. The second condition is if suspension is not desired and the machine is busy.

PREEMPT	        = ( ((Activity == "Suspended") && \
                    ($(ActivityTimer) > $(MaxSuspendTime))) \
                    || (SUSPEND && (WANT_SUSPEND == False)) )

Kill jobs that take too long leaving gracefully.

KILL            = $(ActivityTimer) > $(MaxVacateTime)

Finally, specify periodic checkpointing. For jobs smaller than 60 Mbytes, do a periodic checkpoint every 6 hours. For larger jobs, only checkpoint every 12 hours.

PERIODIC_CHECKPOINT     = ( (ImageSize < 60000) && \
                            ($(LastCkpt) > (6 * $(HOUR))) ) || \ 
                          ( $(LastCkpt) > (12 * $(HOUR)) )

At UW-Madison, we have a fast network. We simplify our expression considerably to

PERIODIC_CHECKPOINT     = $(LastCkpt) > (3 * $(HOUR))

For reference, the entire set of policy settings are included once more without comments:

##  These macros are here to help write legible expressions:
MINUTE          = 60
HOUR            = (60 * $(MINUTE))
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)
LastCkpt        = (CurrentTime - LastPeriodicCheckpoint)

NonCondorLoadAvg        = (LoadAvg - CondorLoadAvg)
BackgroundLoad          = 0.3
HighLoad                = 0.5
StartIdleTime           = 15 * $(MINUTE)
ContinueIdleTime        = 5 * $(MINUTE)
MaxSuspendTime          = 10 * $(MINUTE)
MaxVacateTime           = 10 * $(MINUTE)

KeyboardBusy            = KeyboardIdle < $(MINUTE)
ConsoleBusy             = (ConsoleIdle  < $(MINUTE))
CPUIdle                = $(NonCondorLoadAvg) <= $(BackgroundLoad)
CPUBusy                = $(NonCondorLoadAvg) >= $(HighLoad)
KeyboardNotBusy         = ($(KeyboardBusy) == False)
MachineBusy             = ($(CPUBusy) || $(KeyboardBusy)

WANT_SUSPEND       = ( $(SmallJob) || $(KeyboardNotBusy) \
                       || $(IsPVM) || $(IsVanilla) )
WANT_VACATE        = ( $(ActivationTimer) > 10 * $(MINUTE) \
                       || $(IsPVM) || $(IsVanilla) )
START        = ( (KeyboardIdle > $(StartIdleTime)) \
                  && ( $(CPUIdle) || \
                       (State != "Unclaimed" && State != "Owner")) )
SUSPEND         = ( $(KeyboardBusy) || \
                 ( (CpuBusyTime > 2 * $(MINUTE)) \
                    && $(ActivationTimer) > 90 ) )
CONTINUE        = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
                  && (KeyboardIdle > $(ContinueIdleTime)) )
PREEMPT	        = ( ((Activity == "Suspended") && \
                    ($(ActivityTimer) > $(MaxSuspendTime))) \
                    || (SUSPEND && (WANT_SUSPEND == False)) )
KILL            = $(ActivityTimer) > $(MaxVacateTime)
PERIODIC_CHECKPOINT     = ( (ImageSize < 60000) && \
                            ($(LastCkpt) > (6 * $(HOUR))) ) || \ 
                          ( $(LastCkpt) > (12 * $(HOUR)) )

3.6.9.2 Policy Examples

This example shows how the default macros can be used to set up a machine for testing. Suppose we want the machine to behave normally, except if user coltrane submits a job. In that case, we want that job to start regardless of what is happening on the machine. We do not want the job suspended, vacated or killed. This is reasonable if we know coltrane is submitting very short running programs testing purposes. The jobs should be executed right away. This works with any machine (or the whole pool, for that matter) by adding the following 5 expressions to the existing configuration:

        START      = ($(START)) || Owner == "coltrane"
        SUSPEND    = ($(SUSPEND)) && Owner != "coltrane"
        CONTINUE   = $(CONTINUE)
        PREEMPT    = ($(PREEMPT)) && Owner != "coltrane"
        KILL       = $(KILL)

Notice that there is nothing special in either the CONTINUE or KILL expressions. If Coltrane's jobs never suspend, they never look at CONTINE. Similarly, if they never preempt, they never look at KILL.

Condor can be configured to only run jobs at certain times of the day. In general, we discourage configuring a system like this, since you can often get lots of good cycles out of machines, even when their owners say ``I'm always using my machine during the day.'' However, if you submit mostly vanilla jobs or other jobs that cannot checkpoint, it might be a good idea to only allow the jobs to run when you know the machines will be idle and when they will not be interrupted.

To configure this kind of policy, you should use the ClockMin and ClockDay attributes, defined in section 3.6.1 on ``Startd ClassAd Attributes''. These are special attributes which are automatically inserted by the condor_ startd into its ClassAd, so you can always reference them in your policy expressions. ClockMin defines the number of minutes that have passed since midnight. For example, 8:00am is 8 hours after midnight, or 8 * 60 minutes, or 480. 5:00pm is 17 hours after midnight, or 17 * 60, or 1020. ClockDay defines the day of the week, Sunday = 0, Monday = 1, and so on.

To make the policy expressions easy to read, we recommend using macros to define the time periods when you want jobs to run or not run. For example, assume regular ``work hours'' at your site are from 8:00am until 5:00pm, Monday through Friday:

WorkHours = ( (ClockMin >= 480 && ClockMin < 1020) && \
              (ClockDay > 0 && ClockDay < 6) ) 
AfterHours = ( (ClockMin < 480 || ClockMin >= 1020) || \
               (ClockDay == 0 || ClockDay == 6) )

Of course, you can fine-tune these settings by changing the definition of AfterHours and WorkHours for your site.

Assuming you are using the default policy expressions discussed above, there are only a few minor changes required to force Condor jobs to stay off of your machines during work hours:

# Only start jobs after hours.
START = $(AfterHours) && $(CPUIdle) && KeyboardIdle > $(StartIdleTime)

# Consider the machine busy during work hours, or if the keyboard or
# CPU are busy.
MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )

By default, the MachineBusy macro is used to define the SUSPEND and PREEMPT expressions. If you have changed these expressions at your site, you will need to add $(WorkHours) to your SUSPEND and PREEMPT expressions as appropriate.

Depending on your site, you might also want to avoid suspending jobs during work hours, so that in the morning, if a job is running, it will be immediately preempted, instead of being suspended for some length of time:

WANT_SUSPEND = $(AfterHours)

3.6.10 Differences from the Version 6.0 Policy Settings

This section describes how the current policy expressions differ from the policy expressions in previous versions of Condor. If you have never used Condor version 6.0 or earlier, or you never looked closely at the policy settings, skip this section.

In summary, there is no longer a VACATE expression, and the KILL expression is not evaluated while a machine is claimed. There is a PREEMPT expression which describes the conditions when a machine will move from the Claimed state to the Preempting state. Once a machine is transitioning into the Preempting state, the WANT_VACATE expression controls whether the job should be vacated with a checkpoint or directly killed. The KILL expression determines the transition from Preempting/Vacating to Preempting/Killing.

In previous versions of Condor, the KILL expression handled three distinct cases (the transitions from Claimed/Busy, Claimed/Suspended and Preempting/Vacating), and the VACATE expression handled two cases (the transitions from Claimed/Busy and Claimed/Suspended). In the current version of Condor, PREEMPT handles the same two cases as the previous VACATE expression, but the KILL expression handles one case. Very complex policies can now be specified using all of the default expressions, only tuning the WANT_VACATE and WANT_SUSPEND expressions. In previous versions, heavy use of the WANT_* expressions caused a complex KILL expression.

Next: 3.7 Security In Condor Up: 3. Administrators' Manual Previous: 3.5 User Priorities Contents Index

condor-admin@cs.wisc.edu