next up previous contents index
Next: 7.5 Grid Computing Up: 7. Frequently Asked Questions Previous: 7.3 Running Condor Jobs   Contents   Index

Subsections

7.4 Condor on Windows

Will Condor work on a network of mixed Unix and NT machines?

You can have a Condor pool that consists of both Unix and NT machines.

Your central manager can be either Windows NT or Unix. For example, even if you had a pool consisting strictly of Unix machines, you could use an NT box for your central manager, and vice versa.

Submitted jobs can originate from either a Windows NT or a Unix machine, and be destined to run on Windows NT or a Unix machine. Note that there are still restrictions on the supported universes for jobs executed on Windows NT machines.

So, in summary:

  1. A single Condor pool can consist of both Windows NT and Unix machines.

  2. It does not matter at all if your Central Manager is Unix or NT.

  3. Unix machines can submit jobs to run on other Unix or Windows NT machines.

  4. Windows NT machines can submit jobs to run on other Windows NT or Unix machines.

What versions of Windows will Condor run on?

See Section 1.5, on page [*].

My Windows program works fine when executed on its own, but it does not work when submitted to Condor.

First, make sure that the program really does work outside of Condor under Windows, that the disk is not full, and that the system is not out of user resources.

As the next consideration, know that some Windows programs do not run properly because they are dynamically linked, and they cannot find the .dll files that they depend on. Version 6.4.x of Condor sets the PATH to be empty when running a job. To avoid these difficulties, do one of the following

  1. statically link the application
  2. wrap the job in a script that sets up the environment
  3. submit the job from a correctly-set environment with the command
    getenv = true
    
    in the submit description file. This will copy your environment into the job's environment.
  4. send the required .dll files along with the job using the submit description file command transfer_input_files.

Jobs submitted from NT give an error referring to a credential.

Jobs submitted from an NT machine require a stashed password in order for Condor to run the jobs as the user submitting the jobs. The command which stashes a password for a user is condor_ store_cred. See the manual page on on page [*] for usage details.

The error message that Condor gives if a user has not stashed a password is of the form:

ERROR: No credential stored for username@machinename

        Correct this by running:
	        condor_store_cred add

Jobs submitted from Unix to execute on NT do not work properly.

A difficulty with defaults causes jobs submitted from Unix for execution on an NT platform to remain in the queue, but make no progress. For jobs with this problem, log files will contain error messages pointing to shadow exceptions.

This difficulty stems from the defaults for whether file transfer takes place. The workaround for this problem is to place the line

   TRANSFER_FILES  = ALWAYS
into the submit description file for jobs submitted from a Unix machine for execution on a Windows NT machine.

When I run condor_ status I get a communication error, or the Condor daemon log files report a failure to bind.

Condor uses the first network interface it sees on your machine. This problem usually means you have an extra, inactive network interface (such as a RAS dial up interface) defined before to your regular network interface.

To solve this problem, either change the order of your network interfaces in the Control Panel, or explicitly set which network interface Condor should use by adding the following parameter to your Condor config file:

NETWORK_INTERFACE = ip-address

Where ``ip-address'' is the IP address of the interface you wish Condor to use.

My job starts but exits right away with status 128.

This can occur when the machine your job is running on is missing a DLL (Dynamically Linked Library) required by your program. The solution is to find the DLL file the program needs and put it in the TRANSFER_INPUT_FILES list in the job's submit file.

To find out what DLLs your program depends on, right-click the program in Explorer, choose Quickview, and look under ``Import List''.


Why does the startd crash on CondorNT with the error
"caInsert: Can't insert CpuBusy into target classad."?

This is a common problem with European installations on Windows. The problem is Condor expects all decimal points to be the period character (.), but the Windows locale defines them as the comma character(,). This will be fixed in the next version of Condor for NT, however we have users who have fixed the problem by changing the following registry value to a period instead of a comma: HKEY_USERS\.DEFAULT\Control Panel\International\sDecimal

How can I access network files with Condor on NT?

Features to allow Condor NT to work well with a network file server are coming very soon. However, there are a couple of work-arounds which you can do immediately with the current version of Condor NT in order to access a file server.

The heart of the problem is that on the execute machine, Condor creates a "temporary" user which will run the job... and your file server has never heard of this user before. So the workaround is to either

All of these workarounds have disadvantages, but they may be able to hold you until our code to support shared file servers in Condor is officially released.

Here are the three methods in more detail:

METHOD A - access the file server as a different user via a net use command with a login and password

Example: you want to copy a file off of a server before running it....

   @echo off
   net use \\myserver\someshare MYPASSWORD /USER:MYLOGIN
   copy \\myserver\someshare\my-program.exe
   my-program.exe

The idea here is to simply authenticate to the file server with a different login than the temporary Condor login. This is easy with the "net use" command as shown above. Of course, the obvious disadvantage is this user's password is stored and transferred as clear text.

METHOD B - access the file server as guest

Example: you want to copy a file off of a server before running it as GUEST

   @echo off
   net use \\myserver\someshare
   copy \\myserver\someshare\my-program.exe
   my-program.exe

In this example, you'd contact the server MYSERVER as the Condor temporary user. However, if you have the GUEST account enabled on MYSERVER, you will be authenticated to the server as user "GUEST". If your file permissions (ACLs) are setup so that either user GUEST (or group EVERYONE) has access the share "someshare" and the directories/files that live there, you can use this method. The downside of this method is you need to enable the GUEST account on your file server. WARNING: This should be done *with extreme caution* and only if your file server is well protected behind a firewall that blocks SMB traffic.

METHOD C - access the file server with a "NULL" descriptor

One more option is to use NULL Security Descriptors. In this way, you can specify which shares are accessible by NULL Descriptor by adding them to your registry. You can then use the batch file wrapper like:

net use z: \\myserver\someshare /USER:""
z:\my-program.exe

so long as 'someshare' is in the list of allowed NULL session shares. To edit this list, run regedit.exe and navigate to the key:

HKEY_LOCAL_MACHINE\
   SYSTEM\
     CurrentControlSet\
       Services\
         LanmanServer\
           Parameters\
             NullSessionShares

and edit it. unfortunately it is a binary value, so you'll then need to type in the hex ASCII codes to spell out your share. each share is separated by a null (0x00) and the last in the list is terminated with two nulls.

although a little more difficult to set up, this method of sharing is a relatively safe way to have one quasi-public share without opening the whole guest account. you can control specifically which shares can be accessed or not via the registry value mentioned above.

METHOD D - access with the contrib module from Bristol

Another option: some hardcore Condor users at Bristol University developed their own module for starting jobs under Condor NT to access file servers. It involves storing submitting user's passwords on a centralized server. Below I have included the README from this contrib module, which will soon appear on our website within a week or two. If you want it before that, let me know, and I could e-mail it to you.

Here is the README from the Bristol Condor NT contrib module:

README
Compilation Instructions
Build the projects in the following order

CondorCredSvc
CondorAuthSvc
Crun
Carun
AfsEncrypt
RegisterService
DeleteService
Only the first 3 need to be built in order. This just makes sure that the 
RPC stubs are correctly rebuilt if required. The last 2 are only helper 
applications to install/remove the services. All projects are Visual Studio 
6 projects. The nmakefiles have been exported for each. Only the project 
for Carun should need to be modified to change the location of the AFS 
libraries if needed.

Details
CondorCredSvc
CondorCredSvc is a simple RPC service that serves the domain account 
credentials. It reads the account name and password from the registry of 
the machine it's running on. At the moment these details are stored in 
clear text under the key

HKEY_LOCAL_MACHINE\Software\Condor\CredService

The account name and password are held in REG_SZ values "Account" and 
"Password" respectively. In addition there is an optional REG_SZ value 
"Port" which holds the clear text port number (e.g. "1234"). If this value 
is not present the service defaults to using port 3654.

At the moment there is no attempt to encrypt the username/password when it 
is sent over the wire - but this should be reasonably straightforward to 
change. This service can sit on any machine so keeping the registry entries 
secure ought to be fine. Certainly the ACL on the key could be set to only 
allow administrators and SYSTEM access.

CondorAuthSvc and Crun
These two programs do the hard work of getting the job authenticated and 
running in the right place. CondorAuthSvc actually handles the process 
creation while Crun deals with getting the winstation/desktop/working 
directory and grabbing the console output from the job so that Condor's 
output handling mechanisms still work as advertised. Probably the easiest 
way to see how the two interact is to run through the job creation process:

The first thing to realize is that condor itself only runs Crun.exe. Crun 
treats its command line parameters as the program to really run. e.g. "Crun 
\\mymachine\myshare\myjob.exe" actually causes 
\\mymachine\myshare\myjob.exe to be executed in the context of the domain 
account served by CondorCredSvc. This is how it works:

When Crun starts up it gets its window station and desktop - these are the 
ones created by condor. It also gets its current directory - again already 
created by condor. It then makes sure that SYSTEM has permission to modify 
the DACL on the window station, desktop and directory. Next it creates a 
shared memory section and copies its environment variable block into it. 
Then, so that it can get hold of STDOUT and STDERR from the job it makes 
two named pipes on the machine it's running on and attaches a thread to 
each which just prints out anything that comes in on the pipe to the 
appropriate stream. These pipes currently have a NULL DACL, but only one 
instance of each is allowed so there shouldn't be any issues involving 
malicious people putting garbage into them. The shared memory section and 
both named pipes are tagged with the ID of Crun's process in case we're on 
a multi-processor machine that might be running more than one job. Crun 
then makes an RPC call to CondorAuthSvc to actually start the job, passing 
the names of the window station, desktop, executable to run, current 
directory, pipes and shared memory section (it only attempts to call 
CondorAuthSvc on the same machine as it is running on). If the jobs starts 
successfully it gets the process ID back from the RPC call and then just 
waits for the new process to finish before closing the pipes and exiting. 
Technically, it does this by synchronizing on a handle to the process and 
waiting for it to exit. CondorAuthSvc sets the ACL on the process to allow 
EVERYONE  to synchronize on it.

[ Technical note: Crun adds "C:\WINNT\SYSTEM32\CMD.EXE /C" to the start of 
the command line. This is because the process is created with the network 
context of the caller i.e. LOCALSYSTEM. Prepending cmd.exe gets round any 
unexpected "Access Denied" errors. ]

If Crun gets a WM_CLOSE (CTRL_CLOSE_EVENT) while the job is running it 
attempts to stop the job, again with an RPC call to CondorAuthSvc passing 
the job's process ID.

CondorAuthSvc runs as a service under the LOCALSYSTEM account and does the 
work of starting the job. By default it listens on port 3655, but this can 
be changed by setting the optional REG_SZ value "Port" under the registry key

HKEY_LOCAL_MACHINE\Software\Condor\AuthService

(Crun also checks this registry key when attempting to contact 
CondorAuthSvc.) When it gets the RPC to start a job CondorAuthSvc first 
connects to the pipes for STDOUT and STDERR to prevent anyone else sending 
data to them. It also opens the shared memory section with the environment 
stored by Crun.  It then makes an RPC call to CondorCredSvc (to get the 
name and password of the domain account) which is most likely running on 
another system. The location information is stored in the registry under 
the key

HKEY_LOCAL_MACHINE\Software\Condor\CredService

The name of the machine running CondorCredSvc must be held in the REG_SZ 
value "Host". This should be the fully qualified domain name of the 
machine. You can also specify the optional "Port" REG_SZ value in case you 
are running CondorCredSvc on a different port.

Once the domain account credentials have been received the account is 
logged on through a call to LogonUser. The DACLs on the window station, 
desktop and current directory are then modified to allow the domain account 
access to them and the job is started in that window station and desktop 
with a call to CreateProcessAsUser. The starting directory is set to the 
same as sent by Crun, STDOUT and STDERR handles are set to the named pipes 
and the environment sent by Crun is used. CondorAuthSvc also starts a 
thread which waits on the new process handle until it terminates to close 
the named pipes. If the process starts correctly the process ID is returned 
to Crun.

If Crun requests that the job be stopped (again via RPC), CondorAuthSvc 
loops over all windows on the window station and desktop specified until it 
finds the one associated with the required process ID. It then sends that 
window a WM_CLOSE message, so any termination handling built in to the job 
should work correctly.

[Security Note: CondorAuthSvc currently makes no attempt to verify the 
origin of the call starting the job. This is, in principal, a bad thing 
since if the format of the RPC call is known it could let anyone start a 
job on the machine in the context of the domain user. If sensible security 
practices have been followed and the ACLs on sensitive system directories 
(such as C:\WINNT) do not allow write access to anyone other than trusted 
users the problem should not be too serious.]

Carun and AFSEncrypt
Carun and AFSEncrypt are a couple of utilities to allow jobs to access AFS 
without any special recompilation. AFSEncrypt encrypts an AFS 
username/password into a file (called .afs.xxx) using a simple XOR 
algorithm. It's not a particularly secure way to do it, but it's simple and 
self-inverse. Carun reads this file and gets an AFS token before running 
whatever job is on its command line as a child process. It waits on the 
process handle and a 24 hour timer. If the timer expires first it briefly 
suspends the primary thread of the child process and attempts to get a new 
AFS token before restarting the job, the idea being that the job should 
have uninterrupted access to AFS if it runs for more than 25 hours (the 
default token lifetime). As a security measure, the AFS credentials are 
cached by Carun in memory and the .afs.xxx file deleted as soon as the 
username/password have been read for the first time.

Carun needs the machine to be running either the IBM AFS client or the 
OpenAFS client to work. It also needs the client libraries if you want to 
rebuild it.

For example, if you wanted to get a list of your AFS tokens under Condor 
you would run the following:

Crun \\mymachine\myshare\Carun tokens.exe

Running a job
To run a job using this mechanism specify the following in your job 
submission (assuming Crun is in C:\CondorAuth):

Executable= c:\CondorAuth\Crun.exe
Arguments = \\mymachine\myshare\carun.exe 
\\anothermachine\anothershare\myjob.exe
Transfer_Input_Files = .afs.xxx

along with your usual settings.

Installation
A basic installation script for use with the Inno Setup installation 
package compiler can be found in the Install folder.

What is wrong when condor_ off cannot find my host, and condor_ status does not give me a complete host name?

Given the command

  condor_off hostname2
an error message of the form
  Can't find address for master hostname2.somewhere.edu
appears. Yet, when looking at the host names with
  condor_status -master
the output is of the form
  hostname1.somewhere.edu
  hostname2
  hostname3.somewhere.edu

To correct this incomplete host name, add an entry to the configuration file for DEFAULT_DOMAIN_NAME that specifies the domain name to be used. For the example given, the configuration entry will be

  DEFAULT_DOMAIN_NAME = somewhere.edu

After adding this configuration file entry, use condor_ restart to restart the Condor daemons and effect the change.


next up previous contents index
Next: 7.5 Grid Computing Up: 7. Frequently Asked Questions Previous: 7.3 Running Condor Jobs   Contents   Index
condor-admin@cs.wisc.edu