next up previous contents index
Next: 3.10 Setting Up for Up: 3. Administrators' Manual Previous: 3.8 DaemonCore   Contents   Index

Subsections


3.9 Pool Management

There are a number of administrative tools Condor provides to help you manage your pool. The following sections describe various tasks you might wish to perform on your pool and explains how to most efficiently do them.

All of the commands described in this section must be run from a machine listed in the HOSTALLOW_ADMINISTRATOR setting in your config files, so that the IP/host-based security allows the administrator commands to be serviced. See section 3.7.5 on page [*] for full details about IP/host-based security in Condor.


3.9.1 Shutting Down and Restarting your Condor Pool

There are a couple of situations where you might want to shutdown and restart your entire Condor pool. In particular, when you want to install new binaries, it is generally best to make sure no jobs are running, shutdown Condor, and then install the new daemons.


3.9.1.1 Shutting Down your Condor Pool

The best way to shutdown your pool is to take advantage of the remote administration capabilities of the condor_ master. The first step is to save the IP address and port of the condor_ master daemon on all of your machines to a file, so that even if you shutdown your condor_ collector, you can still send administrator commands to your different machines. You do this with the following command:

        % condor_status -master -format "%s\n" MasterIpAddr > addresses

The first step to shutting down your pool is to shutdown any currently running jobs and give them a chance to checkpoint. Depending on the size of your pool, your network infrastructure, and the image-size of the standard jobs running in your pool, you may want to make this a slow process, only vacating one host at a time. You can either shutdown hosts that have jobs submitted (in which case all the jobs from that host will try to checkpoint simultaneously), or you can shutdown individual hosts that are running jobs. To shutdown a host, simply send:

        % condor_off hostname
where ``hostname'' is the name of the host you want to shutdown. This will only work so long as your condor_ collector is still running. Once you have shutdown Condor on your central manager, you will have to rely on the addresses file you just created.

If all the running jobs are checkpointed and stopped, or if you're not worried about the network load put in effect by shutting down everything at once, it is safe to turn off all daemons on all machines in your pool. You can do this with one command, so long as you run it from a blessed administrator machine:

        % condor_off -all

condor_ off will shutdown all the daemons, but leave the condor_ master running, so that you can send a condor_ on in the future.

Once all of the Condor daemons (except the condor_ master) on each host is turned off, you're done. You are now safe to install new binaries, move your checkpoint server to another host, or any other task that requires the pool to be shutdown to successfully complete.

NOTE: If you are planning to install a new condor_ master binary, be sure to read the following section for special considerations with this somewhat delicate task.


3.9.1.2 Installing a New condor_ master

If you are going to be installing a new condor_ master binary, there are a few other steps you should take. If the condor_ master restarts, it will have a new port it is listening on, so your addresses file will be stale information. Moreover, when the master restarts, it doesn't know that you sent it a condor_ off in its past life, and will just start up all the daemons it's configured to spawn unless you explicitly tell it otherwise.

If you just want your pool to completely restart itself whenever the master notices its new binary, neither of these issues are of any concern and you can skip this (and the next) section. Just be sure installing the new master binary is the last thing you install, and once you put the new binary in place, the pool will restart itself over the next 5 minutes (whenever all the masters notice the new binary, which they each check for once every 5 minutes by default).

However, if you want to have absolute control over when the rest of the daemons restart, you must take a few steps.

  1. Put the following setting in your global config file:
            START_DAEMONS = False
    
    This will make sure that when the master restarts itself that it doesn't also start up the rest of its daemons.
  2. Install your new condor_ master binary.
  3. Start up Condor on your central manager machine. You will have to do this manually by logging into the machine and sending commands locally. First, send a condor_ restart to make sure you've got the new master, then send a condor_ on to start up the other daemons (including, most importantly, the condor_ collector).
  4. Wait 5 minutes, such that all the masters have a chance to notice the new binary, restart themselves, and send an update with their new address. Make sure that:
            % condor_status -master
    
    lists all the machines in your pool.
  5. Remove the special setting from your global config file.
  6. Recreate your addresses file as described above:
            % condor_status -master -format "%s\n" MasterIpAddr > addresses
    

Once the new master is in place, and you're ready to start up your pool again, you can restart your whole pool by simply following the steps in the next section.


3.9.1.3 Restarting your Condor Pool

Once you are done performing whatever tasks you need to perform and you're ready to restart your pool, you simply have to send a condor_ on to all the condor_ master daemons on each host. You can do this with one command, so long as you run it from a blessed administrator machine:

        % condor_on `cat addresses`
That's it. All your daemons should now be restarted, and your pool will be back on its way.


3.9.2 Reconfiguring Your Condor Pool

If you change a global config file setting and want to have all your machines start to use the new setting, you must send a condor_ reconfig command to each host. You can do this with one command, so long as you run it from a blessed administrator machine:

        % condor_reconfig -all

NOTE: If your global config file is not shared among all your machines (using a shared filesystem), you will need to make the change to each copy of your global config file before sending the condor_ reconfig.


next up previous contents index
Next: 3.10 Setting Up for Up: 3. Administrators' Manual Previous: 3.8 DaemonCore   Contents   Index
condor-admin@cs.wisc.edu