Running Condor

Last Modified: 11/24/2011, Condor 2.7.4, Ubuntu 10.04 LTS

Brief Overview

From Wikipedia: "Condor is an open source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks.". It seems to shine in situations where the resources (i.e. the c. The classic example is the desktop that is used during the day but available at night (or during an off-period in a lab). This is not my personal need, my resources are always available, so I won't deal with those configuration options here. Also, I'm pretty lax on security as my environment is my own and not available outside my network.

I set up condor on a couple of machines and ran some jobs. Note, there is a difference between "Personal Condor" and Condor. Personal Condor allows everything to run standalone, I assume for the purposes of testing out Condor. This differs solely (from what I see) in terms of configuration (and resources available of course), the binaries included are the same. Much of the pain I went through was due from the transistion from one to the other. Another interesting thing to note is that they don't distribute their source code.

And, at the risk of stating the obvious: This is a distributed-computing framework, so resources are expected to be available, at least at runtime. Condor has ways of shipping files when it runs and even making sure it runs on the right OSes or Architectures, which I won't cover here. However, I will say that if you're reading and writing files, you need to be sure that you can:

  • do you have a NAS or some distributed filesystem? (BTW, Condor can talk HDFS
  • Is the executable on the machine, if not delivered by Condor
  • Are the directories/reference files available?
  • Installation

    Simple. On each machine run:

    sudo apt-get install condor

    See that it's running with:

    ps aux |grep condor_

    Starting Condor Daemons

    sudo service condor start

    Stopping/Restarting Condor: I've found that that the stop script doesn't work, so I've been doing this, YMMV:

    ps aux |grep condor_ | awk '{ print $2 }' | xargs sudo kill -9

    Configuration

    Concepts

    The machines in a Condor Pool have different roles. Depending on the roles are daemons that should be run. Here's a very simple (and flip) description so you get the idea:

  • Central Manager: this is the head node that doles out the jobs
  • Execute Node: this will actually run the code
  • Submit Node: jobs can be submitted from this node. (I don't worry about this too much here, in this example, everything is a submit node)
  • Furthermore, machines have be explicitly configured (or security opened up, which is what I did in this case) to talk to each other. See the reference section below for details on the daemons.

    Central Manager Configuration

    See my diff against the stock config file (/etc/condor/condor_config). You can also do config.local for configuration overrides, which I didn't do in this case. Again, note that security is wide open in my case:

    supertom@hadoop-1:~/code/condor$ diff -ru ~/condor_config.sav /etc/condor/condor_config
    --- /home/supertom/condor_config.sav    2011-11-23 10:06:09.729551664 -0800
    +++ /etc/condor/condor_config   2011-11-23 08:45:13.999343070 -0800
     ##  Pathnames:
    @@ -94,7 +97,8 @@
     ##  If your machines don't use a network file system, set it to
     ##  FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
     ##  to specify that each machine has its own file system. 
    -FILESYSTEM_DOMAIN      = $(FULL_HOSTNAME)
    +#FILESYSTEM_DOMAIN     = $(FULL_HOSTNAME)
    +FILESYSTEM_DOMAIN      = mydomain.com
     
     ##  This macro is used to specify a short description of your pool. 
     ##  It should be about 20 characters long. For example, the name of 
    @@ -144,7 +148,7 @@
     ##  to flock to. (i.e. you are specifying the machines that you
     ##  want your jobs to be negotiated at -- thereby specifying the
     ##  pools they will run in.)
    -FLOCK_TO = 
    +FLOCK_TO = myhost.mydomain.com
     ##  An example of this is:
     #FLOCK_TO = central_manager.friendly.domain, condor.cs.wisc.edu
     
    @@ -172,7 +176,8 @@
     ##  machine(s) where whoever is the condor administrator(s) works
     ##  (assuming you trust all the users who log into that/those
     ##  machine(s), since this is machine-wide access you're granting).
    -HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
    +#HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
    +HOSTALLOW_ADMINISTRATOR = *
     
     ##  If there are no machines that should have administrative access 
     ##  to your pool (for example, there's no machine where only trusted
    @@ -186,7 +191,8 @@
     ##  issue to their own machine (like condor_vacate).  This defaults to
     ##  machines with administrator access, and the local machine.  This
     ##  is probably what you want.
    -HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
    +#HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
    +HOSTALLOW_OWNER = *
     
     ##  Read access.  Machines listed as allow (and/or not listed as deny)
     ##  can view the status of your pool, but cannot join your pool 
    @@ -198,7 +204,7 @@
     ##  will be able to view the status of your pool and more easily help
     ##  you install, configure or debug your Condor installation.
     ##  It is important to have this defined.
    -HOSTALLOW_READ = $(FULL_HOSTNAME)
    +HOSTALLOW_READ = *
     #HOSTALLOW_READ = *.your.domain, *.cs.wisc.edu
     #HOSTDENY_READ = *.bad.subnet, bad-machine.your.domain, 144.77.88.*
     
    @@ -213,17 +219,19 @@
     ##    HOSTALLOW_WRITE = *
     ##  but note that this will allow anyone to submit jobs or add
     ##  machines to your pool and is serious security risk.
    -HOSTALLOW_WRITE = $(FULL_HOSTNAME)
    +HOSTALLOW_WRITE = *
     #HOSTALLOW_WRITE = *.your.domain, your-friend's-machine.other.domain
     #HOSTDENY_WRITE = bad-machine.your.domain
     
     ##  Negotiator access.  Machines listed here are trusted central
     ##  managers.  You should normally not have to change this.
    -HOSTALLOW_NEGOTIATOR = $(CONDOR_HOST)
    +#HOSTALLOW_NEGOTIATOR = $(CONDOR_HOST)
    +HOSTALLOW_NEGOTIATOR = *
     ##  Now, with flocking we need to let the SCHEDD trust the other 
     ##  negotiators we are flocking with as well.  You should normally 
     ##  not have to change this either.
    -HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
    +#HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
    +HOSTALLOW_NEGOTIATOR_SCHEDD = *
    

    Execute Node Configuration

    Same config file (/etc/condor/condor_config), different machine. So the config is slightly different. Notice the DAEMON_LIST var in particular. Also, CONODR_HOST needs to point to your Central Manager Host above. And finally, the FILESYSTEM_DOMAIN var must match across machines if you want to distribute workloads across them.

    supertom@hadoop-5:~$ diff /etc/condor/condor_config.sav /etc/condor/condor_config
    50c50,54
    < CONDOR_HOST   = $(FULL_HOSTNAME)
    ---
    > #CONDOR_HOST  = $(FULL_HOSTNAME)
    > CONDOR_HOST   = myhost.mydomain.com
    > #CONDOR_HOST  = 192.168.157.10
    > STARTER_ALLOW_RUNAS_OWNER = True
    91c95,96
    < UID_DOMAIN            = $(FULL_HOSTNAME)
    ---
    > #  UID_DOMAIN = $(FULL_HOSTNAME)
    > UID_DOMAIN            = mydomain.com
    97c102,103
    < FILESYSTEM_DOMAIN     = $(FULL_HOSTNAME)
    ---
    > #FILESYSTEM_DOMAIN    = $(FULL_HOSTNAME)
    > FILESYSTEM_DOMAIN     = mydomain.com
    102c108
    < COLLECTOR_NAME                = Ubuntu Default Personal Pool
    ---
    > COLLECTOR_NAME                = My Pool
    175c181,182
    < HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
    ---
    > #HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
    > HOSTALLOW_ADMINISTRATOR = *
    189c196,197
    < HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
    ---
    > #HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
    > HOSTALLOW_OWNER = *
    201c209
    < HOSTALLOW_READ = $(FULL_HOSTNAME)
    ---
    > HOSTALLOW_READ = *
    216c224
    < HOSTALLOW_WRITE = $(FULL_HOSTNAME)
    ---
    > HOSTALLOW_WRITE = *
    226c234,235
    < HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
    ---
    > #HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
    > HOSTALLOW_NEGOTIATOR_SCHEDD = *
    373c382
    < #DEFAULT_DOMAIN_NAME = your.domain.name
    ---
    > DEFAULT_DOMAIN_NAME = mydomain.com
    430a440
    > TRUST_UID_DOMAIN = True
    1006c1016,1017
    < DAEMON_LIST                   = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR
    ---
    > #DAEMON_LIST                  = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR
    > DAEMON_LIST                   = MASTER, STARTD, SCHEDD
    

    Usage

    I'm going to be pretty light here on usage for now, as there is lots of info on it. If you've gotten everything working up to this point with multiple nodes, you're in good job. Still, lets make sure your installation really works. Since I'm lazy, I suggest following the tutorial here to create your sample program and job file.

    Job Config File

    Create this file and call it "job.submit". This assumes the simple.c program from above, and we assume that this is available on very machine

    Universe   = vanilla
    Executable = simple
    Arguments  = 4 10
    Log        = simple.log
    Output     = simple.out
    Error      = simple.error
    Queue      60
    

    Make the Queue number greater than the number of slots (essentially cores) that you have available. I would tune the params of the simple.c program (from the tutorial linked above) to create a longer pause so you can see what's going on inside Condor.

    Submitting a Job

    condor_submit myjob.submit

    See the utilities below to learn about what happened:

    Utilties

  • condor_status - see the status of the pool, including what machines are online and their status
  • condor_q - see the queue of jobs being processed
  • condor_q -analyze - find the reason why jobs might be in the queue waiting (IDLE, HELD)
  • condor_rm - remove a job from the queue
  • condor_submit - send a job to condor
  • condor_config_val -dump: dump the configuration of the running daemons
  • Important Files

  • Configuration: /etc/condor/condor_config
  • Logs: /var/lib/condor/logs
  • Troubleshooting

    Each Daemon has a log in /var/lib/condor/log. ex:
  • CollectorLog
  • NegotiatorLog and there are a few others that get created, for example:
  • MatchLog: shows how a Job and a Machine were matched up
  • ShadowLog: details about how the job was submitted
  • These logs are your friends.

    References

  • Condor Project Homepage
  • Wikipedia (Overview)
  • Daemon Description
  • Mailing List Archives
  • Condor Tutorial
  • Back to Code