next up previous contents index
Next: 7.4 Condor on Windows Up: 7. Frequently Asked Questions Previous: 7.2 Setting up Condor   Contents   Index

Subsections

7.3 Running Condor Jobs

I'm at the University of Wisconsin-Madison Computer Science Dept., and I am having problems!

Please see the web page http://www.cs.wisc.edu/condor/uwcs. As it explains, your home directory is in AFS, which by default has access control restrictions which can prevent Condor jobs from running properly. The above URL will explain how to solve the problem.

I'm getting a lot of e-mail from Condor. Can I just delete it all?

Generally you shouldn't ignore all of the mail Condor sends, but you can reduce the amount you get by telling Condor that you don't want to be notified every time a job successfully completes, only when a job experiences an error. To do this, include a line in your submit file like the following:

Notification = Error

See the Notification parameter in the condor_ q man page on page [*] of this manual for more information.

Why will my vanilla jobs only run on the machine where I submitted them from?

Check the following:

  1. Did you submit the job from a local filesystem that other computers can't access?

    See Section 3.3.5, on page [*].

  2. Did you set a special requirements expression for vanilla jobs that's preventing them from running but not other jobs?

    See Section 3.3.5, on page [*].

  3. Is Condor running as a non-root user?

    See Section 3.7.2, on page [*].

My job starts but exits right away with signal 9.

This can occur when the machine your job is running on is missing a shared library required by your program. One solution is to install the shared library on all machines the job may execute on. Another, easier, solution is to try to re-link your program statically so it contains all the routines it needs.

Why aren't any or all of my jobs running?

Problems like the following are often reported to us:

I have submitted 100 jobs to my pool, and only 18 appear to be
running, but there are plenty of machines available.  What should I
do to investigate the reason why this happens?

Start by following these steps to understand the problem:

  1. Run condor_ q -analyze and see what it says.

  2. Look at the User Log file (whatever you specified as "log = XXX" in the submit file).

    See if the jobs are starting to run but then exiting right away, or if they never even start.

  3. Look at the SchedLog on the submit machine after it negotiates for this user. If a user doesn't have enough priority to get more machines the SchedLog will contain a message like "lost priority, no more jobs".

  4. If jobs are successfully being matched with machines, they still might be dying when they try to execute due to file permission problems or the like. Check the ShadowLog on the submit machine for warnings or errors.

  5. Look at the NegotiatorLog during the negotiation for the user. Look for messages about priority, "no more machines", or similar.

Another problem shows itself with statements within the log file produced by the condor_ schedd daemon (given by $(SCHEDD_LOG)) that say the following:

Swap Space Estimate Reached, 0 jobs matched, 13 jobs idle

Condor computes the total swap space on your submit machine. It then tries to limit the total number of jobs it will spawn based on an estimate of the size of the condor_ shadow daemon's memory footprint and a configurable amount of swap space that should be reserved. This is done to avoid the situation within a very large pool in which all the jobs are submitted from a single host. The huge number of condor_ shadow processes would overwhelm the submit machine, it would run out of swap space, and thrash.

Things can go wrong if a machine has a lot of physical memory and little or no swap space. Condor does not consider the physical memory size, so the situation occurs where Condor thinks it has no swap space to work with, and it will not run the submitted jobs.

To see how much swap space Condor thinks a given machine has, use the output of a condor_ status command of the following form:

condor_status -schedd [hostname] -long | grep VirtualMemory

If the value listed is 0, then this is what is confusing Condor. There are two ways to fix the problem:

  1. Configure your machine with some real swap space.

  2. Disable this check within Condor. Define the amount of reserved swap space for the submit machine to 0. Set RESERVED_SWAP to 0 in the configuration file:

    RESERVED_SWAP = 0
    

    and then send a condor_ restart to the submit machine.


Why does the requirements expression for the job I submitted
have extra things that I did not put in my submit description file?

There are several extensions to the submitted requirements that are automatically added by Condor. Here is a list:


Can I submit my standard universe SPARC Solaris 2.6 jobs and have them run on a SPARC Solaris 2.7 machine?

No. You may only use binary compatibility between SPARC Solaris 2.5.1 and SPARC Solaris 2.6 and between SPARC Solaris 2.7 and SPARC Solaris 2.8, but not between SPARC Solaris 2.6 and SPARC Solaris 2.7. We may implement support for this feature in a future release of Condor.

Can I submit my standard universe SPARC Solaris 2.8 jobs and have them run on a SPARC Solaris 2.9 machine?

No. Although normal executables are binary compatible, technical details of taking checkpoints currently prevents this particular combination. Note that this applies to standard universe jobs only.


Why do my vanilla jobs keep cycling between suspended and unsuspended?

Condor tries to provide a number, the ``Condor Load Average'' (reported in the machine ClassAd as CondorLoadAvg), which is intended to represent the total load average on the system caused by any running Condor job(s). Unfortunately, it is impossible to get an accurate number for this without support from the operating system. This is not available. So, Condor does the best it can, and it mostly works in most cases. However, there are a number of ways this statistic can go wrong.

The old default Condor policy was to suspend if the non-Condor load average went over a certain threshold. However, because of the problems providing accurate numbers for this (described below), some jobs would go into a cycle of getting suspended and resumed. The default suspend policy now shipped with Condor uses the solution explained here.

While there are too many technical details of why CondorLoadAvg might be wrong for a short answer here, a brief explanation is presented. When a job has periodic behavior, and the load it places upon a machine is changing over time, the system load also changes over time. However, Condor thinks that the job's share of the system load (what it uses to compute the CondorLoad) is also changing. So, when the job was running, and then stops, both the system load and the Condor load start falling. If it all worked correctly, they'd fall at the exact same rate, and NonCondorLoad would be constant. Unfortunately, CondorLoadAvg falls faster, since Condor thinks the job's share of the total load is falling, too. Therefore, CondorLoadAvg falls faster than the system load, NonCondorLoad goes up, and the old default SUSPEND expression becomes true.

It appears that Condor should be able to avoid this problem, but for a host of reasons, it can not. There is no good way (without help from the operating systems Condor runs on; the help does not exist) to get this right. The only way to compute these numbers more accurately without support from the operating system is to sample everything at such a high rate that Condor itself would create a large load average, just to try to compute the load average. This is Heisenberg's uncertainty principle in action.

A similar sampling error can occur when Condor is starting a job within the vanilla universe with many processes and with a heavy initial load. Condor mistakenly decides that the load on the machine has gotten too high while the job is in the initialization phase and kicks the job off the machine.

To correct this problem, Condor needs to check to see if the load of the machine has been high over an interval of time. There is an attribute, CpuBusyTime that can be used for this purpose. This macro returns the time $(CpuBusy) (defined in the default configuration file) has been true, or 0 if $(CpuBusy) is false. $(CpuBusy) is usually defined in terms of non-Condor load. These are the default settings:

NonCondorLoadAvg    = (LoadAvg - CondorLoadAvg)
HighLoad            = 0.5
CPUBusy             = ($(NonCondorLoadAvg) >= $(HighLoad))

To take advantage of CpuBusyTime, you can use it in your SUSPEND expression.

Here is an example:

SUSPEND = (CpuBusyTime > 3 * $(MINUTE)) && ((CurrentTime - JobStart) > 90)

The above policy says to only suspend the job if the cpu has been busy with non-Condor load at least three minutes and it has been at least 90 seconds since the start of the job.

Why might my job be preempted (evicted)?

There are four circumstances under which Condor may evict a job. They are controlled by different expressions.

Reason number 1 is the user priority: controlled by the PREEMPTION_REQUIREMENTS expression in the configuration file. If there is a job from a higher priority user sitting idle, the condor_ negotiator daemon may evict a currently running job submitted from a lower priority user if PREEMPTION_REQUIREMENTS is True. For more on user priorities, see section 2.7 and section 3.5.

Reason number 2 is the owner (machine) policy: controlled by the PREEMPT expression in the configuration file. When a job is running and the PREEMPT expression evaluates to True, the condor_ startd will evict the job. The PREEMPT expression should reflect the requirements under which the machine owner will not permit a job to continue to run. For example, a policy to evict a currently running job when a key is hit or when it is the 9:00am work arrival time, would be expressed in the PREEMPT expression and enforced by the condor_ startd. For more on the PREEMPT expression, see section 3.6.

Reason number 3 is the owner (machine) preference: controlled by the RANK expression in the configuration file (sometimes called the startd rank or machine rank). The RANK expression is evaluated as a floating point number. When one job is running, a second idle job that evaluates to a higher RANK value tells the condor_ startd to prefer the second job over the first. Therefore, the condor_ startd will evict the first job so that it can start running the second (preferred) job. For more on RANK, see section 3.6.

Reason number 4 is if Condor is to be shutdown: on a machine that is currently running a job. Condor evicts the currently running job before proceeding with the shutdown.

What signals get sent to my jobs when Condor needs to preempt or kill them, or when I remove them from the queue? Can I tell Condor which signals to send?

The answer is dependent on the universe of the jobs.

Under the scheduler universe, the signal jobs get upon condor_ rm can be set by the user in the submit description file with the form of

remove_kill_sig = SIGWHATEVER
If this command is not defined, Condor further looks for a command in the submit description file with the form
kill_sig = SIGWHATEVER
And, if that command is also not given, Condor uses SIGTERM.

For all other universes, the jobs get the value of the submit description file command kill_sig, which is SIGTERM by default.

If a job is killed or evicted, the job is sent a kill_sig, unless it is on the receiving end of a hard kill, in which case it gets SIGKILL.

Under all universes, the signal is sent only to the parent PID of the job, namely, the first child of the condor_ starter. If the child itself is forking, the child must catch and forward signals as appropriate. This in turn depends on the user's desired behavior. The exception to this is (again) where the job is receiving a hard kill. Condor sends the value SIGKILL to all the PIDs in the family.

Why does my Linux job have an enormous ImageSize and refuse to run anymore?

Sometimes Linux jobs run, are preempted and can not start again because Condor thinks the image size of the job is too big. This is because Condor has a problem calculating the image size of a program on Linux that uses threads. It is particularly noticeable in the Java universe, but it also happens in the vanilla universe. It is not an issue in the standard universe, because threaded programs are not allowed.

On Linux, each thread appears to consume as much memory as the entire program consumes, so the image size appears to be (number-of-threads * image-size-of-program). If your program uses a lot of threads, your apparent image size balloons. You can see the image size that Condor believes your program has by using the -l option to condor_q, and looking at the ImageSize attribute.

When you submit your job, Condor creates or extends the requirements for your job. In particular, it adds a requirement that you job must run on a machine with sufficient memory:

Requirements = ... ((Memory * 1024) >= ImageSize) ...

(Note that memory is the execution machine's memory in kilobytes while ImageSize is in bytes). When your application is threaded, the image size appears to be much larger than it really is, and you may not have a machine with sufficient memory to handle this requirement.

Unfortunately, calculating the correct ImageSize is rather hard to fix on Linux, and we do not yet have a good solution. Fortunately, there is a workaround while we work on a good solution for a future release.

In the Requirements expression above, Condor added (Memory * 1024) >= ImageSize) on your behalf. You can prevent Condor from doing this by giving it your own expression about memory in your submit file, just as:

Requirements = Memory > 1024

You will need to change 1024 to a reasonably good estimate of the actual image size of your program, in kilobytes. This expression says that your program requires 1 megabyte of memory. If you underestimate the memory your application needs, you may have bad performance if you job runs on machines that have insufficient memory.

In addition, if you have modified your machine policies to preempt jobs when they get big a ImageSize, you will need to change those policies.


Why does the time output from condor_ status appear as [?????] ?

Condor collects timing information for a large variety of uses. Collection of the data relies on accurate times. Being a distributed system, clock skew among machines causes errant timing calculations. Values can be reported too large or too small, with the possibility of calculating negative timing values.

This problem may be seen by the user when looking at the output of condor_ status. If the ActivityTime field appears as [?????], then this calculated statistic was negative. condor_ status recognizes that a negative amount of time will be nonsense to report, and instead displays this string.

The solution to the problem is to synchronize the clocks on these machines. An administrator can do this using a tool such as ntp.


next up previous contents index
Next: 7.4 Condor on Windows Up: 7. Frequently Asked Questions Previous: 7.2 Setting up Condor   Contents   Index
condor-admin@cs.wisc.edu