From Wikipedia: "Condor is an open source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks.". It seems to shine in situations where the resources (i.e. the c. The classic example is the desktop that is used during the day but available at night (or during an off-period in a lab). This is not my personal need, my resources are always available, so I won't deal with those configuration options here. Also, I'm pretty lax on security as my environment is my own and not available outside my network.
I set up condor on a couple of machines and ran some jobs. Note, there is a difference between "Personal Condor" and Condor. Personal Condor allows everything to run standalone, I assume for the purposes of testing out Condor. This differs solely (from what I see) in terms of configuration (and resources available of course), the binaries included are the same. Much of the pain I went through was due from the transistion from one to the other. Another interesting thing to note is that they don't distribute their source code.
And, at the risk of stating the obvious: This is a distributed-computing framework, so resources are expected to be available, at least at runtime. Condor has ways of shipping files when it runs and even making sure it runs on the right OSes or Architectures, which I won't cover here. However, I will say that if you're reading and writing files, you need to be sure that you can:
Simple. On each machine run:
sudo apt-get install condor
See that it's running with:
ps aux |grep condor_
Starting Condor Daemons
sudo service condor start
Stopping/Restarting Condor: I've found that that the stop script doesn't work, so I've been doing this, YMMV:
ps aux |grep condor_ | awk '{ print $2 }' | xargs sudo kill -9
The machines in a Condor Pool have different roles. Depending on the roles are daemons that should be run. Here's a very simple (and flip) description so you get the idea:
Furthermore, machines have be explicitly configured (or security opened up, which is what I did in this case) to talk to each other. See the reference section below for details on the daemons.
See my diff against the stock config file (/etc/condor/condor_config). You can also do config.local for configuration overrides, which I didn't do in this case. Again, note that security is wide open in my case:
supertom@hadoop-1:~/code/condor$ diff -ru ~/condor_config.sav /etc/condor/condor_config --- /home/supertom/condor_config.sav 2011-11-23 10:06:09.729551664 -0800 +++ /etc/condor/condor_config 2011-11-23 08:45:13.999343070 -0800 ## Pathnames: @@ -94,7 +97,8 @@ ## If your machines don't use a network file system, set it to ## FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) ## to specify that each machine has its own file system. -FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) +#FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) +FILESYSTEM_DOMAIN = mydomain.com ## This macro is used to specify a short description of your pool. ## It should be about 20 characters long. For example, the name of @@ -144,7 +148,7 @@ ## to flock to. (i.e. you are specifying the machines that you ## want your jobs to be negotiated at -- thereby specifying the ## pools they will run in.) -FLOCK_TO = +FLOCK_TO = myhost.mydomain.com ## An example of this is: #FLOCK_TO = central_manager.friendly.domain, condor.cs.wisc.edu @@ -172,7 +176,8 @@ ## machine(s) where whoever is the condor administrator(s) works ## (assuming you trust all the users who log into that/those ## machine(s), since this is machine-wide access you're granting). -HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST) +#HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST) +HOSTALLOW_ADMINISTRATOR = * ## If there are no machines that should have administrative access ## to your pool (for example, there's no machine where only trusted @@ -186,7 +191,8 @@ ## issue to their own machine (like condor_vacate). This defaults to ## machines with administrator access, and the local machine. This ## is probably what you want. -HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR) +#HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR) +HOSTALLOW_OWNER = * ## Read access. Machines listed as allow (and/or not listed as deny) ## can view the status of your pool, but cannot join your pool @@ -198,7 +204,7 @@ ## will be able to view the status of your pool and more easily help ## you install, configure or debug your Condor installation. ## It is important to have this defined. -HOSTALLOW_READ = $(FULL_HOSTNAME) +HOSTALLOW_READ = * #HOSTALLOW_READ = *.your.domain, *.cs.wisc.edu #HOSTDENY_READ = *.bad.subnet, bad-machine.your.domain, 144.77.88.* @@ -213,17 +219,19 @@ ## HOSTALLOW_WRITE = * ## but note that this will allow anyone to submit jobs or add ## machines to your pool and is serious security risk. -HOSTALLOW_WRITE = $(FULL_HOSTNAME) +HOSTALLOW_WRITE = * #HOSTALLOW_WRITE = *.your.domain, your-friend's-machine.other.domain #HOSTDENY_WRITE = bad-machine.your.domain ## Negotiator access. Machines listed here are trusted central ## managers. You should normally not have to change this. -HOSTALLOW_NEGOTIATOR = $(CONDOR_HOST) +#HOSTALLOW_NEGOTIATOR = $(CONDOR_HOST) +HOSTALLOW_NEGOTIATOR = * ## Now, with flocking we need to let the SCHEDD trust the other ## negotiators we are flocking with as well. You should normally ## not have to change this either. -HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) +#HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) +HOSTALLOW_NEGOTIATOR_SCHEDD = *
Same config file (/etc/condor/condor_config), different machine. So the config is slightly different. Notice the DAEMON_LIST var in particular. Also, CONODR_HOST needs to point to your Central Manager Host above. And finally, the FILESYSTEM_DOMAIN var must match across machines if you want to distribute workloads across them.
supertom@hadoop-5:~$ diff /etc/condor/condor_config.sav /etc/condor/condor_config 50c50,54 < CONDOR_HOST = $(FULL_HOSTNAME) --- > #CONDOR_HOST = $(FULL_HOSTNAME) > CONDOR_HOST = myhost.mydomain.com > #CONDOR_HOST = 192.168.157.10 > STARTER_ALLOW_RUNAS_OWNER = True 91c95,96 < UID_DOMAIN = $(FULL_HOSTNAME) --- > # UID_DOMAIN = $(FULL_HOSTNAME) > UID_DOMAIN = mydomain.com 97c102,103 < FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) --- > #FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) > FILESYSTEM_DOMAIN = mydomain.com 102c108 < COLLECTOR_NAME = Ubuntu Default Personal Pool --- > COLLECTOR_NAME = My Pool 175c181,182 < HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST) --- > #HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST) > HOSTALLOW_ADMINISTRATOR = * 189c196,197 < HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR) --- > #HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR) > HOSTALLOW_OWNER = * 201c209 < HOSTALLOW_READ = $(FULL_HOSTNAME) --- > HOSTALLOW_READ = * 216c224 < HOSTALLOW_WRITE = $(FULL_HOSTNAME) --- > HOSTALLOW_WRITE = * 226c234,235 < HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) --- > #HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) > HOSTALLOW_NEGOTIATOR_SCHEDD = * 373c382 < #DEFAULT_DOMAIN_NAME = your.domain.name --- > DEFAULT_DOMAIN_NAME = mydomain.com 430a440 > TRUST_UID_DOMAIN = True 1006c1016,1017 < DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR --- > #DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR > DAEMON_LIST = MASTER, STARTD, SCHEDD
I'm going to be pretty light here on usage for now, as there is lots of info on it. If you've gotten everything working up to this point with multiple nodes, you're in good job. Still, lets make sure your installation really works. Since I'm lazy, I suggest following the tutorial here to create your sample program and job file.
Create this file and call it "job.submit". This assumes the simple.c program from above, and we assume that this is available on very machine
Universe = vanilla Executable = simple Arguments = 4 10 Log = simple.log Output = simple.out Error = simple.error Queue 60
Make the Queue number greater than the number of slots (essentially cores) that you have available. I would tune the params of the simple.c program (from the tutorial linked above) to create a longer pause so you can see what's going on inside Condor.
condor_submit myjob.submit
See the utilities below to learn about what happened: