A directed acyclic graph (DAG) can be used to represent a set of programs where the input, output, or execution of one or more programs is dependent on one or more other programs. The programs are nodes (vertices) in the graph, and the edges (arcs) identify the dependencies. Condor finds machines for the execution of programs, but it does not schedule programs (jobs) based on dependencies. The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan submits jobs to Condor in an order represented by a DAG and processes the results. An input file defined prior to submission describes the DAG, and a Condor submit description file for each program in the DAG is used by Condor.
Each node (program) in the DAG needs its own Condor submit description file. As DAGMan submits jobs to Condor, it monitors the Condor log file(s) to to enforce the ordering required for the DAG.
The DAG itself is defined by the contents of a DAGMan input file. DAGMan is responsible for scheduling, recovery, and reporting for the set of programs submitted to Condor.
The following sections specify the use of DAGMan.
The input file used by DAGMan specifies five items:
Comments may be placed in the input file that describes the DAG.
The pound character (#
) as the first character on a
line identifies the line as a comment.
Comments do not span lines.
An example input file for DAGMan is
# Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor Script PRE A top_pre.csh Script PRE B mid_pre.perl $JOB Script POST B mid_post.perl $JOB $RETURN Script PRE C mid_pre.perl $JOB Script POST C mid_post.perl $JOB $RETURN Script PRE D bot_pre.csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3
This input file describes the DAG shown in Figure 2.2.
The first section of the input file lists all the programs that appear in the DAG. Each program is described by a single line called a Job Entry. The syntax used for each Job Entry is
JOB JobName SubmitDescriptionFileName [DONE]
A Job Entry maps a JobName to a Condor submit description file. The JobName uniquely identifies nodes within the DAGMan input file and within output messages.
The keyword JOB and the JobName are not case sensitive. A JobName of joba is equivalent to JobA. The SubmitDescriptionFileName is case sensitive, since the UNIX file system is case sensitive. The JobName can be any string that contains no white space.
The optional DONE identifies a job as being already completed. This is useful in situations where the user wishes to verify results, but does not need all programs within the dependency graph to be executed. The DONE feature is also utilized when an error occurs causing the DAG to not be completed. DAGMan generates a Rescue DAG, a DAGMan input file that can be used to restart and complete a DAG without re-executing completed programs.
The second type of item in a DAGMan input file enumerates processing that is done either before a program within the DAG is submitted to Condor for execution or after a program within the DAG completes its execution. Processing done before a program is submitted to Condor is called a PRE script. Processing done after a program successfully completes its execution under Condor is called a POST script. A node in the DAG is comprised of the program together with PRE and/or POST scripts. The dependencies in the DAG are enforced based on nodes.
Syntax for PRE and POST script lines within the input file
SCRIPT PRE JobName ExecutableName [arguments]
SCRIPT POST JobName ExecutableName [arguments]
The SCRIPT keyword identifies the type of line within the DAG input file. The PRE or POST keyword specifies the relative timing of when the script is to be run. The JobName specifies the node to which the script is attached. The ExecutableName specifies the script to be executed, and it may be followed by any command line arguments to that script. The ExecutableName and optional arguments have their case preserved.
Scripts are optional for each job, and any scripts are executed on the machine to which the DAG is submitted.
The PRE and POST scripts are commonly used when files must be placed into a staging area for the job to use, and files are cleaned up or removed once the job is finished running. An example using PRE/POST scripts involves staging files that are stored on tape. The PRE script reads compressed input files from the tape drive, and it uncompresses them, placing the input files in the current directory. The program within the DAG node is submitted to Condor, and it reads these input files. The program produces output files. The POST script compresses the output files, writes them out to the tape, and then deletes the staged input and output files.
DAGMan takes note of the exit value of the scripts as well as the program. If the PRE script fails (exit value != 0), then neither the program nor the POST script runs, and the node is marked as failed.
If the PRE script succeeds, the program is submitted to Condor. If the program fails and there is no POST script, the DAG node is marked as failed. An exit value not equal to 0 indicates program failure. It is therefore important that the program returns the exit value 0 to indicate the program did not fail.
If the program fails and there is a POST script, node failure is determined by the exit value of the POST script. A failing value from the POST script marks the node as failed. A succeeding value from the POST script (even with a failed program) marks the node as successful. Therefore, the POST script may need to consider the return value from the program.
By default, the POST script is run regardless of the program's return value. To prevent POST scripts from running after failed jobs, pass the -NoPostFail argument to condor_ submit_dag.
A node not marked as failed at any point is successful.
Two variables are available to ease script writing. The $JOB variable evaluates to JobName. For POST scripts, the $RETURN variable evaluates to the return value of the program. The variables may be placed anywhere within the arguments.
As an example, suppose the PRE script expands a compressed file named JobName.gz. The SCRIPT entry for jobs A, B, and C are
SCRIPT PRE A pre.csh $JOB .gz SCRIPT PRE B pre.csh $JOB .gz SCRIPT PRE C pre.csh $JOB .gz
The script pre.csh may use these arguments
#!/bin/csh gunzip $argv[1]$argv[2]
The third type of item in the DAG input file describes the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any child node may be started. A child node is started once all its parents have successfully completed.
The syntax of a dependency line within the DAG input file:
PARENT ParentJobName... CHILD ChildJobName...
The PARENT keyword is followed by one or more ParentJobNames. The CHILD keyword is followed by one or more ChildJobNames. Each child job depends on every parent job on the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. As an example, the line
PARENT p1 p2 CHILD c1 c2produces four dependencies:
p1
to c1
p1
to c2
p2
to c1
p2
to c2
The fourth type of item in the DAG input file provides a way (optional) to retry failed nodes. The syntax for retry is
RETRY JobName NumberOfRetries
where the JobName is the same as the name given in a Job Entry line, and NumberOfRetries is an integer, the number of times to retry the node after failure. The default number of retries for any node is 0, the same as not having a retry line in the file.
In the event of retry, all parts of a node within the DAG are redone, following the same rules regarding node failure as given above. The PRE script is executed first, followed by submitting the program to Condor upon success of the PRE script. Failure of the node is then determined by the return value of the program, the existence and return value of a POST script.
The fifth type of item in the DAG input file provides a method of defining a macro to be placed into the submit description file. These macros are defined on a per-node basis, using the following format.
VARS JobName macroname="string"...
The definition of the macro is available to use within the submit description file, as the submission for the node's executable is of the form
condor_submit SubmitDescriptionFileName -a "+macroname=\"string\""Job submission with the "+" attribute not only adds the attribute to the ClassAd of the job, but also allows dereferencing of the attribute through the macro usage in the submit description file.
There may be more than one macro defined for each JobName. The space character delimits the list of macros.
Correct syntax requires that the string
must be
enclosed in double quotes.
To use a double quote inside string
,
escape it with the backslash character (\
).
To add the backslash character itself, use two backslashes (\\
).
Each node in a DAG may be a unique executable, and each may have a unique Condor submit description file. Each program may be submitted to a different universe within Condor, for example standard, vanilla, or DAGMan.
One limitation exists:
each Condor submit description file must submit only one job.
There may not be multiple queue
commands, or DAGMan will fail.
DAGMan no longer requires that all jobs specify the same log file.
However, if the DAG contains a very large number of jobs, each
specifying its own log file, performance may suffer. Therefore,
if the DAG contains a large number of jobs, it is best to have
all of the jobs use the same log file.
DAGMan enforces the dependencies within a DAG
using the events recorded in the
log file(s) produced by job submission to Condor.
Here is an example Condor submit description file to go with the diamond-shaped DAG example.
# Filename: diamond_job.condor # executable = /path/diamond.exe output = diamond.out.$(cluster) error = diamond.err.$(cluster) log = diamond_condor.log universe = vanilla notification = NEVER queue
This example uses the same Condor submit description file for all the jobs in the DAG. This implies that each node within the DAG runs the same program. The $(cluster) macro is used to produce unique file names for each program's output. Each job is submitted separately, into its own cluster, so this provides unique names for the output files.
The notification is set to NEVER
in this example.
This tells Condor not to send e-mail about the completion of a program
submitted to Condor.
For DAGs with many nodes, this is recommended
to reduce or eliminate excessive numbers of e-mails.
A separate example shows an intended use of a Vars entry in the DAG. It can be used to dramatically reduce the number of submit description files needed for a DAG. In the case where the submit description file for each node varies only in file naming, the use of a substitution macro within the submit description file allows the use of a single submit description file. Note that the node output log file currently cannot be specified using a macro passed from the DAG.
The example uses a single submit description file in the DAG input file, and uses the Vars entry to name output files.
# submit description file called: theonefile.sub executable = progX output = $(outfilename) error = error.$(outfilename) universe = standard queue
The relevant portion of the DAG input file appears as
JOB A theonefile.sub JOB B theonefile.sub JOB C theonefile.sub VARS A outfilename="A" VARS B outfilename="B" VARS C outfilename="C"
For a DAG like this one with thousands of nodes, being able to write and maintain a single submit description file and a single, yet more complex, DAG input file is preferable.
A DAG is submitted using the program condor_ submit_dag.
See the manual
page
for complete details.
A simple submission has the syntax
condor_ submit_dag DAGInputFileName
The example may be submitted with
condor_submit_dag diamond.dagIn order to guarantee recoverability, the DAGMan program itself is run as a Condor job. As such, it needs a submit description file. condor_ submit_dag produces the needed file, naming it by appending the DAGInputFileName with .condor.sub. This submit description file may be editted if the DAG is submitted with
condor_submit_dag -no_submit diamond.dagcausing condor_ submit_dag to generate the submit description file, but not submit DAGMan to Condor. To submit the DAG, once the submit description file is editted, use
condor_submit diamond.dag.condor.sub
An optional argument to condor_ submit_dag, -maxjobs, is used to specify the maximum number of Condor jobs that DAGMan may submit to Condor at one time. It is commonly used when there is a limited amount of input file staging capacity. As a specific example, consider a case where each job will require 4 Mbytes of input files, and the jobs will run in a directory with a volume of 100 Mbytes of free space. Using the argument -maxjobs 25 guarantees that a maximum of 25 jobs, using a maximum of 100 Mbytes of space, will be submitted to Condor at one time.
While the -maxjobs argument is used to limit the number of Condor jobs submitted at one time, it may be desirable to limit the number of scripts running at one time. The optional -maxpre argument limits the number of PRE scripts that may be running at one time, while the optional -maxpost argument limits the number of POST scripts that may be running at one time.
After submission, the progress of the DAG can be monitored by looking at the log file(s), observing the e-mail that program submission to Condor causes, or by using condor_ q -dag.
condor_ submit_dag attempts to check the DAG input file. If a problem is detected, condor_ submit_dag prints out an error message and aborts.
DAGMan normally generates a list of job log files to monitor by examing all of the job submission files. If that will not work (some job submission files will be generated by PRE scripts, for example), you can specify a single common log file with the -log option. An example of this submission:
condor_submit_dag -log diamond_condor.logThis option tells condor_ submit_dag use the given file as the log file, if no log files are specified in submit files.
To remove an entire DAG, consisting of DAGMan plus any jobs submitted to Condor, remove the DAGMan job running under Condor. condor_ q will list the job number. Use the job number to remove the job, for example
% condor_q -- Submitter: turunmaa.cs.wisc.edu : <128.105.175.125:36165> : turunmaa.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 smoler 10/12 11:47 0+00:01:32 R 0 8.7 condor_dagman -f - 11.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 B.out 12.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 C.out 3 jobs; 2 idle, 1 running, 0 held % condor_rm 9.0
Before the DAGMan job stops running, it uses condor_ rm to remove any Condor jobs within the DAG that are running.
In the case where a machine is scheduled to go down, DAGMan will clean up memory and exit. However, it will leave any submitted jobs in Condor's queue.
DAGMan can help with the resubmission of uncompleted portions of a DAG when one or more nodes resulted in failure. If any node in the DAG fails, the remainder of the DAG is continued until no more forward progress can be made based on the DAG's dependencies. At this point, DAGMan produces a file called a Rescue DAG.
The Rescue DAG is a DAG input file, functionally the same as the original DAG file. It additionally contains indication of successfully completed nodes using the DONE option in the input description file. If the DAG is resubmitted using this Rescue DAG input file, the nodes marked as completed will not be re-executed.
The Rescue DAG is automatically generated by DAGMan when a node within the DAG fails. The file is named using the DAGInputFileName, and appending the suffix .rescue to it. Statistics about the failed DAG execution are presented as comments at the beginning of the Rescue DAG input file.
If the Rescue DAG file is generated before all retries of a node are completed, then the Rescue DAG file will also contain Retry entries. The number of retries will be set to the appropriate remaining number of retries.
It can be helpful to see a picture of a DAG. DAGMan can assist you in visualizing a DAG by creating the input files used by the AT&T Research Labs graphviz package. dot is a program within this package, available from http://www.research.att.com/sw/tools/graphviz, and it is used to draw pictures of DAGs.
DAGMan produces one or more dot files as the result of an extra line in a DAGMan input file. The line appears as
DOT dag.dot
This creates a file called dag.dot. which contains a specification of the DAG before any programs within the DAG are submitted to Condor. The dag.dot file is used to create a visualization of the DAG by using this file as input to dot. This example creates a Postscript file, with a visualization of the DAG:
dot -Tps dag.dot -o dag.ps
Within the DAGMan input file, the DOT command can take several optional parameters:
DOT dag.dot DONT-OVERWRITEcauses files dag.dot.0, dag.dot.1, dag.dot.2, etc. to be created. This option is most useful combined with the UPDATE option to visualize the history of the DAG after it has finished executing.
label=
.
This may be useful if further editing of the created files would
be necessary,
perhaps because you are automatically visualizing the DAG as it
progresses.
If conflicting parameters are used in a DOT command, the last one listed is used.