Computing On Demand (COD) extends Condor's high throughput computing abilities to include a method for running short-term jobs on instantly-available resources. Support for COD was added to Condor in version 6.5.2.
The motivation for COD extends Condor's job management to include interactive, compute-intensive jobs, giving these jobs immediate access to the compute power they need over a relatively short period of time. COD provides computing power on demand, switching predefined resources from working on Condor jobs to working on the COD jobs. These COD jobs (applications) cannot use the batch scheduling functionality of Condor, since the COD jobs require interactive response-time. Many of the applications that are well-suited to Condor's COD capabilities involve a cycle: application blocked on user input, computation burst to compute results, block again on user input, computation burst, etc. When the resources are not being used for the bursts of computation to service the application, they should continue to execute long-running batch jobs.
Here are examples of applications that may benefit from COD capability:
recalculate
button,
the predefined Condor resources (nodes)
work on the computation and send the results
back to the master application providing the user interface and
displaying the data.
Ideally, while the user is entering new data or modifying formulas,
these nodes work on non-COD jobs.
The way Condor helps these kinds of applications is to provide an infrastructure to use Condor batch resources for the types of compute nodes described above. Condor does NOT provide tools to parallelize existing GUI applications. The COD functionality is an interface to allow these compute nodes to interact with long-running Condor batch jobs. The user provides both the compute node applications and the interactive master application that controls them. Condor only provides a mechanism to allow these interactive (and often parallelized) applications to seamlessly interact with the Condor batch system.
The resources of a Condor pool (nodes) run jobs. When a high-priority COD job appears at a node, the lower-priority (currently running) batch job is suspended. The COD job runs immediately, while the batch job remains suspended. When the COD job completes, the batch job instantly resumes execution.
Administratively, an interactive COD application puts claims on nodes. While the COD application does not need the nodes (to run the COD jobs), the claims are suspended, allowing batch jobs to run.
Claims on nodes are assigned to users. A user with a claim on a resource can then suspend and resume a COD job at will. This gives the user a great deal of power on the claimed resource, even if it is owned by another user. Because of this, it is essential that users allowed to claim COD resources can be trusted not to abuse this power. Users are authorized to have access to the privilege of creating and using a COD claim on a machine. This privilege is granted when the Condor administrator places a given username in the VALID_COD_USERS list in the Condor configuration for the machine (usually in a local configuration file).
In addition, the tools to request and manage COD claims
require that the user issuing the commands be authenticated.
Use one of the strong authentication methods described
in section 3.7.3 ``Security Configuration'' on
page .
If one of these methods cannot be used,
then file system authentication may be used
when directly logging in to that machine (to be claimed)
and issuing the command locally.
To run an application on a claimed COD resource, an authorized user defines characteristics of the application. Examples of characteristics are the executable or script to use, the directory to run the application in, command-line arguments, and files to use for standard input and output. COD users specify a ClassAd that describes these characteristics for their application. There are two ways for a user to define a COD application's ClassAd:
These two methods for defining the ClassAd can be used together. For example, the user can define some attributes in the configuration file, and only provide a few dynamically defined attributes with the condor_ cod tool.
Regardless of how the COD application's ClassAd is defined, the application's executable and input data must be pre-staged at the node. This is a current limitation of Condor's support for COD that will eventually go away. For now, there is no mechanism to transfer files for a COD application, and all I/O must be performed locally or onto a network file system that is accessible by a node.
The following three sections detail defining the attributes. The first lists the attributes that can be used to define a COD application. The second describes how to define these attributes in a Condor configuration file. The third explains how to define these attributes using the condor_ cod tool.
Attributes are for a COD application are either required or optional. The following attributes are required:
"
).
There is no default.
"
).
There is no default.
"
).
The following list of attributes are optional:
"
).
"
).
NAME=value
.
Multiple variables are delimited with a semicolon.
An example: Env = "PATH=/usr/local/bin:/usr/bin;TERM=vt100"
It is a string attribute, and must therefore be enclosed in
quotation marks ("
).
"
).
"
).
Multiple file names may be delimited with either commas or whitespace
characters, and
therefore, file names can not contain spaces.
KillSig = "SIGQUIT"
), or as an integer
(KillSig = 3
)
The default is to use SIGTERM.
"
).
TRUE
.
The default if not specified is FALSE
.
NOTE: If any path attribute (Cmd, In, Out,Err, StarterUserLog) is not a full path name, Condor automatically prepends the value of IWD.
The final set of attributes define an identification for a COD application. The job ID is made up of both the ClusterId and ProcId attributes (as described below). This job ID is similar to the job ID that is created whenever a regular Condor batch job is submitted. When using COD, the job ID is only used to identify the job in various log messages and in the COD-specific output of condor_ status.
The COD job ID is part of the information included in all
events written to the StarterUserLog
regarding a given job.
The COD job ID is also used in the Condor debugging logs described in
section 3.3.3 on
page
For example, in the condor_ starter daemon's log file for COD jobs
(called StarterLog.cod by default) or in the condor_ startd
daemon's log
file (called StartLog by default).
These COD IDs are optional. The job ID is useful to define where it helps a user with accounting or debugging of their own application.
To define COD attributes in the Condor configuration file for a given application, the user selects a keyword to uniquely name ClassAd attributes of the application. This case-insensitive keyword is used as a prefix for the various configuration file attribute names. When a user wishes to spawn a given application, the keyword is given as an argument to the condor_ cod tool and the keyword is used at the remote COD resource to find attributes which define the application.
Any of the ClassAd attributes described in the previous section can be
specified in the configuration file with the keyword prefix followed
by an underscore character ("_"
).
For example, if the user's keyword for a given fractal generation application is ``FractGen'', the resulting entries in the Condor configuration file may appear as:
FractGen_Cmd = "/usr/local/bin/fractgen" FractGen_Iwd = "/tmp/cod-fractgen" FractGen_Out = "/tmp/cod-fractgen/output" FractGen_Err = "/tmp/cod-fractgen/error" FractGen_Args = "mandelbrot -0.65865,-0.56254 -0.45865,-0.71254"
In this example, the executable may create other files. The Out and Err attributes specified in the configuration file are only for standard output and standard error redirection.
When the user wishes to spawn an instance of this application, they use the -keyword option of FractGen in the command-line of the condor_ cod_activate command.
NOTE: If a user is defining all attributes of their COD application in the Condor configuration files, and the condor_ startd daemon on the COD resource they are using is running as root, the user must also define Owner to be the user that the COD application should run as (see section 4.3.3 above).
COD users may define attributes dynamically (at the time they spawn a COD application). In this case, the user writes the ClassAd attributes into a file, and the file name is passed to the condor_ cod_activate tool using the -jobad command-line option. These attributes are read by the condor_ cod tool and passed through the system onto the condor_ starter daemon which spawns the COD application. If the file name given is -, the condor_ cod tool will read from standard input (stdin).
Users should not add a keyword prefix when defining attributes with the condor_ cod_activate tool. The attribute names can be used in the file directly.
WARNING: The current syntax for this file is not the same as the syntax in the file used with condor_ submit.
NOTE: Users should not define the Owner attribute when using condor_ cod_activate on the command line, since Condor will automatically insert the correct value based on what user runs the condor_ cod_activate command and how that user authenticates to the COD resource. If a user defines an attribute that does not match the authenticated identity, Condor treats this case as an error, and it will fail to launch the application.
Separate commands are provided by Condor to manage COD claims on batch resources. Once created, each COD claim has a unique identifying string, called the claim ID. Most commands require a claim ID to specify which claim you wish to act on. These commands are the means by which COD applications interact with the rest of the Condor system. They should be issued by the controller application to manage its compute nodes. Here is a list of the commands:
To issue these commands, a user or application invokes the condor_ cod tool. A command may be specified as the first argument to this tool, as
condor_cod -request -name c02.cs.wisc.eduor the condor_ cod tool can be installed in such a way that the same binary is used for a set of names, as
condor_cod_request -name c02.cs.wisc.edu
In addition, there is now a -cod option to condor_ status.
The following sections describe each option in greater detail.
A user must be granted authorization to a create COD claims on a specific machine. In addition, when the user uses these COD claims, the application they wish to run must be pre-staged on the machine. Therefore, a user cannot simply request a COD claim at random.
The user specifies the resource on which to make a COD claim. This is accomplished by specifying the name of the condor_ startd daemon desired by invoking condor_ cod_request with the -name option and the host name. For example:
condor_cod_request -name c02.cs.wisc.eduIf the condor_ startd daemon desired belongs to a different Condor pool than the one where executing the COD commands, use the -pool option to provide the name of the central manager machine of the other pool. For example:
condor_cod_request -name c02.cs.wisc.edu -pool condor.cs.wisc.edu
An alternative provides the IP address and port number where the condor_ startd daemon is listening with the -addr option. This information can be found in the condor_ startd ClassAd as the attribute StartdIpAddr or by reading the log file when the condor_ startd first starts up. For example:
condor_cod_request -addr "<128.105.146.102:40967>"
If neither -name or -addr are specified, condor_ cod_request attempts to connect to the condor_ startd daemon running on the local machine (where the request command was issued).
If the condor_ startd daemon to be used for the COD claim is an SMP machine and has multiple virtual machines, specify which resource on the machine to use for COD by providing a -requirements option. For example:
condor_cod_request -requirements 'VirtualMachineId==3'or
condor_cod_request -requirements 'State!="Claimed"'
In general, be careful with shell quoting issues, so that your shell is not confused by the ClassAd expression syntax (in particular if the expression includes a string). The safest method is to enclose any requirement expression you provide within single quote marks (as shown above).
Once a given condor_ startd daemon has been contacted to request a new COD claim, the condor_ startd daemon checks for proper authorization of the user issuing the command. If the user has the authority, and the condor_ startd daemon finds a resource that matches any given requirements, the condor_ startd daemon creates a new COD claim and gives it a unique identifier, the claim ID. This ID is used to identify COD claims when using other commands. If condor_ cod_request succeeds, the claim ID for the new claim is printed out to the screen. All other commands to manage this claim require the claim ID to be provided as a command-line option.
When the condor_ startd daemon assigns a COD claim,
the ClassAd describing the resource is returned to the user that
requested the claim.
This ClassAd is a snap-shot of
the output of condor_status -long
for the given machine.
If condor_ cod_request is invoked with the -classad option
(which takes a file name as an argument), this ClassAd will be written
out to the given file.
Otherwise, the ClassAd is printed to the screen.
The only essential piece of information in this ClassAd is the Claim
ID, so that is printed to the screen, even if the whole ClassAd is
also being written to a file.
NOTE: Once a COD claim is created, there is no persistent record of it kept by the condor_ startd daemon. So, if the condor_ startd daemon is restarted for any reason, all existing COD claims will be destroyed and the new condor_ startd daemon will not recognize any attempts to use the previous claims.
Condor's support for COD has a few limitations.
The following items are all limitations we plan to remove in future releases of Condor:
None of the above items are fundamentally difficult to add and we hope to address them relatively quickly. If you run into one of these limitations and it is a barrier to you using COD, please contact condor-admin@cs.wisc.edu with the subject ``COD limitation'' to gain quick help.
The following list are more fundamental limitations that we do not plan to address: