Condor Cluster

UTCS operates a general-purpose High Throughput Computing (HTC) cluster, Mastodon, which is managed using the Condor job scheduling software. In addition to a large number of dedicated compute nodes, the cluster also makes use of idle desktop machines. Mastodon is available to the faculty, staff, and students of the department for research and educational purposes. At this point, the cluster is used almost exclusively for running batch jobs, but Condor also supports parallel jobs using MPI. The hardware is not optimal for very heavy parallel computing, in that it uses gigabit ethernet for interconnects, rather than proprietary low-latency technologies, but it does have its own dedicated enterprise class switch, and some of our users have done MPI work on it.

A subset of the Mastodon cluster has been set aside to allow direct logins for jobs which do not work under Condor. If you find that you cannot make your jobs work under Condor, please send email to gripe@cs, and we will either help you solve the Condor problem or arrange for login access to this group of semi-public machines. All users on these machines have to work together to avoid resource conflicts, as there is no provision for automatic resource management and no way for staff to sort out who should be using what machines at what time.

Mastodon now has its own dedicated Network Appliance Filer with about 12TB of usable space on it. If you intend to run jobs which are I/O intensive or which require large amounts of data, please write to gripe@cs to get a directory on this server. The scratch space has no quotas, but it is not backed up, so it is not intended for long-term storage of important data. This storage is not available outside the cluster and the job submission nodes, submit32 and submit64.

Mastodon hardware configuration

The Mastodon cluster consists of the following hardware:

Host(s) Function Server Model Processor Memory
melkor central manager Dell PowerEdge 1950 2x E5440 Xeon (quad-core) @ 2.8GHz 32GB
switch57 cluster backbone switch Cisco Catalyst 6509
filer4a file storage Network Appliance FAS3050c 24TB
submit32, submit64 dedicated submit nodes Dell PowerEdge R410 2x E5620 Xeon (quad-core) @ 2.40GHz 24GB
angrist compute node x 5: angrist-N Dell PowerEdge M620 2x Xeon E5-2670 (8 core) @ 2.60GHz 128GB
eldar GeForce GTX Titan Black GPU node x 10: eldar-N Dell PowerEdge M720 2x Xeon E5-2603 (4 core) @ 1.80GHz 8GB
glamdring compute node x 14: glamdring-N Dell PowerEdge M910 2x Xeon E7-2870 (10 core) @ 2.40GHz 512GB
gundabad compute node x 2: gundabad-N Dell PowerEdge M610 2x Xeon X5690 (12 core) @ 3.47GHz 96GB
narsil compute node x 13: narsil-N Dell PowerEdge M710 2x Xeon X5675 (6 core) @ 3.06GHz 96GB
nauro compute node x 64: nauro-N Dell PowerEdge 1950 2x Opteron 2218 (dual-core) @ 2.6GHz 16GB
oliphaunt-0 compute node x 1 Dell PowerEdge r905 4x Opteron 8435 (6 core) @ 2.6GHz 256GB
oliphaunt-1 and -2 compute node x 2 Dell PowerEdge R720 2x Xeon E5-2665 (8 core) @ 2.4GHz 768GB
orcrist compute node x 37: orcrist-N Dell PowerEdge M610 2x E5530 Xeon (quad-core) @ 2.4GHz 32GB
rhavan compute node x 77: rhavan-N Dell PowerEdge 1950 2x Xeon X5355 (quad-core) @ 2.66GHz 32GB
uvanimor compute node x 21: uvanimor-N Dell PowerEdge 1950 2x Xeon X5440 (quad-core) @ 2.83GHz 32GB

About Condor

What is Condor?

From the Overview of the Condor Manual:

Instead of running a CPU-intensive job in the background on their own workstation, a user submits their job to Condor. Condor finds an available machine on the network and begins running the job on that machine. When Condor detects that a machine running a Condor job is no longer available (perhaps because the owner of the machine came back from lunch and started typing on the keyboard), Condor checkpoints the job and migrates it over the network to a different machine which would otherwise be idle. Condor restarts the job on the new machine to continue from precisely where it left off. If no machine on the network is currently available, then the job is stored in a queue on disk until a machine becomes available.

As previously mentioned, the UTCS Condor pool consists of dedicated Linux x86 servers and various Linux x86 desktop machines Most of the staff-supported machines in the department run some subset of the Condor daemons so that they can at least submit jobs and query the status of the pool. Many of them also run the job execution daemons so that they contribute to the overall computing capacity when they are not being used by their owners.

For more information on using Condor, see the Condor User's Manual section of the Condor Manual.

Getting Started with Condor

All of the Condor programs are in /lusr/opt/condor/bin; add that directory to your PATH environment variable or use that path with the commands described below. To use condor_compile, you do need that directory in your path.

Job Submission Overview

Note: The machine aliased as "linux.cs", currently diligence, does not run the Condor daemons, nor do the headless camera, virtue and sin machines.

  1. Optional: Compile and link the program with Condor's libraries to enable the job to be migrated among the machines using condor_compile.
  2. Submit the job with condor_submit.
  3. The job will enter the queue and be executed. To monitor it, use condor_q.
  4. To check the status of the condor pool, use condor_status.
  5. To remove a job before it has finished executing, use condor_rm.

condor_compile

Linking a program with Condor's libraries allows it to be checkpointed, which dumps an image of the running process to a special storage area on the cluster servers. After that it can be migrated to a different machine and restarted where it left off. In order to do this, when compiling the program, add "condor_compile" to the beginning of the command line that links the program. For example:

condor_compile gcc -O -o myprogram.condor file1.c file2.c ...

If you do not link the program with Condor's libraries, the job can still be submitted to the pool and successfully executed, but it cannot be checkpointed. Instead, if the job is interrupted by a machine failure, by the owner becoming active on the machine console, or by a higher-priority job, it will be suspended for a while, and then if the machine does not become available before the timeout period, it will be killed and restarted from the beginning elsewhere.

condor_submit

condor_submit uses a submit description file which tells Condor what executable to use, which directory to run the program in, what command line arguments to pass to the program, and so forth. Submit descriptions can be rather elaborate, specifying what sort of machine is required by the job in great detail, or they can be as simple as the following example from the User's Manual.

####################################
#
# Example 1
# Simple condor job description file
#
####################################

Executable = foo
Log = foo.log
Queue

Submitting a job with a simple submit description like this one will assume that the job should be run on a machine of the same architecture as the submitting machine and will use /dev/null for stdin, stdout, and stderr.

###################################################
#
# Example 2: Show off some fancy features including
# use of pre-defined macros and logging.
#
###################################################

Executable = foo
Requirements = Memory >= 32 && OpSys == "SOLARIS29" && Arch =="SUN4u"
Rank = Memory >= 64

Image_Size = 28 Meg

Error = err.$(Process)
Input = in.$(Process)
Output = out.$(Process)
Log = foo.log

Queue 150

The above description file will submit 150 runs of program foo to the Condor pool. The program will run only on UltraSparc machines running Solaris 9 and having at least 32MB of RAM. It prefers to run on machines with at least 64MB RAM, if available. Each instance of the program will use up to 28MB of RAM while running, and each will have its own input, output, and error file with the process number appended to it (err.0, in.0, out.0 through err.149, in.149, out.149).

################################################
#
# Example 3
# Control email notifications about Condor jobs.
#
################################################

Executable = foo
Log = foo.log

Notification = Complete
Queue 50000

The above description file will submit 50,000 jobs and will send email to the submitter upon job completion or failure. Notification options are Always, Complete, Error (our default), and Never. A setting of Always will cause email to be sent every time the job checkpoints, in addition to sending one upon error or completion.

UTCS-specific Information

In addition to the features described in the Condor documentation, predicates have been added locally to allow you to specify that the job must run on cluster nodes, and others which allow us to better track how the cluster is being used.

To force your job to run on the cluster nodes, add a "requirements" line to the submit description file like:

requirements = InMastodon

We have two other local variables defined which we use for tracking the way our clusters are used.

Each submit description file should define the user's group with one of the following:

+Group = "PROF"
+Group = "GRAD"
+Group = "UNDER"
+Group = "UTGRID"
+Group = "GUEST"

These indicate that the submitter is a CS professor, a CS graduate student, a CS undergraduate, a UTGRID partner, or a guest account, respectively. For special high-priority jobs, please email gripe@cs for more information.

Each submit description file must also define the UTCS research area to which the job relates with one of the following.

+Project = "ARCHITECTURE"
+Project = "FORMAL_METHODS"
+Project = "AI_ROBOTICS"
+Project = "OPERATING_DISTRIBUTED_SYSTEMS"
+Project = "NETWORKING_MULTIMEDIA"
+Project = "PROGRAMMING_LANGUAGES"
+Project = "THEORY"
+Project = "GRAPHICS_VISUALIZATION"

+Project = "COMPONENT_BASED_SOFTWARE"
+Project = "SCIENTIFIC_COMPUTING"
+Project = "COMPUTATIONAL_BIOLOGY"
+Project = "INSTRUCTIONAL"
+Project = "UTGRID"
+Project = "OTHER"

In addition, each submit file must define a project description field which tells a little more about what you are working on. This is a free-form text field which cannot be blank.

+ProjectDescription = "simulation of population growth of starship tribble colonies"

NOTE: Jobs submitted without valid project codes or without some sort of project description will sit idle in the queue. You will not receive a notice explaining that your job cannot run.

Job Management Tips

Include "error" and "log" path commands in your job description file. If a job mysteriously fails to run, the error and log files may contain some clue why.

#########################################
#
# Example 4: Show off some fancy features
# and local predicates.
#
#########################################

+Group = "GRAD"
+Project = "ARCHITECTURE"

+ProjectDescription = "simulating the use of dilithium substrate"

Executable = chipsim
Requirements = Memory >= 2000 && InMastodon
Rank = Memory >= 4000
Image_Size = 1900 Meg

Error = err.$(Process)
Input = in.$(Process)
Output = out.$(Process)
Log = chipsim.log

For more information and example submit files, see the condor_submit manual page and the Submitting a Job section of the User's Manual.

condor_q

condor_q displays information about the jobs in the queue. Some useful examples follow.

  • condor_q - With no parameters specified, condor_q returns a list of jobs submitted from the current machine. 
  • condor_q -g -  This will return all jobs in the Condor pool.
  • condor_q -g -submitter jdoe - This will return all jobs in the queue submitted by user "jdoe".
  • condor_q -g -run - This will return all jobs currently running.
  • condor_q -g -better-analyze 2248.9 -  This will give a detailed analysis of why job 2248.9 is not running.

For more information, see the condor_q manual page.

condor_status

condor_status provides status information for the cluster as a whole, for individual machines, or for individual virtual machines on a node. Queries can also be limited by using various constraints. See the following examples.

  • condor_status - This will give the overall status of the Condor pool.
  • condor_status uruk-12 - This will give the status of node uruk-12.
  • condor_status vm1@uruk-hai-2 - This will give the status of virtual machine 1 on node uruk-hai-2.
  • condor_status -constraint 'Memory>2000' - This will give the status of machines with memory greater than 2000MB.
  • condor_status -constraint 'InMastodon' - This will give the status of machines in the Mastodon cluster -- it excludes desktops.

For more information, see the condor_status manual page.

condor_rm

condor_rm is used to delete one or more jobs from the queue. Users can remove only their own jobs.

  • condor_rm 1234.8 - This will remove job 1234.8 from the scheduler on the current machine.
  • condor_rm 1234 - This will delete all jobs in job cluster 1234.
  • condor_rm jdoe -constraint Activity!="Busy" - This will remove all inactive jobs in the Condor pool owned by user "jdoe", assuming you are jdoe or a queue administrator.
  • condor_rm -forcex 1234.34 - This will force deletion of job 1234.34 after the normal condor_rm has failed to kill it. Use this when you have jobs stuck in the "X" state for a long time.

For more information, see the condor_rm manual page.

Java and other interpreted languages

The interpreters for languages such as Perl, Python, and Java are not linked with the Condor libraries. As a result, any jobs written in these languages need to be run in the vanilla universe or in the java universe. Add the following line to the submit-description file:

universe = vanilla

or for Java jobs:

universe = java

Java (and several other programs such as /lusr/bin/matlab that have shell script wrappers) sometimes have problems when run under Condor. For Java, you may get an error, "Error: can't find libjava.so." To work around these errors, create a shell script named something like run_my_java_prog.sh which calls your executable with the appropriate environment, for example:

#!/bin/sh
#
# shell script to execute java jobs in vanilla universe
#
export CLASSPATH=/projects/whatever/java:.
exec /usr/bin/java -arg1 arg2 ...

Then specify /bin/sh as the executable and use this script as the argument in a submit-description file containing the following:

executable = /bin/sh
arguments = run_my_java_prog.sh

(Note that information on running Java jobs in the vanilla universe is provided for historical reasons. The java universe is now functional, so these wrapper scripts are really more appropriate for Matlab, python, perl, and other jobs, though they should still work in the vanilla universe. A java universe example will follow.)

In case you'd like to keep a unique identifier on your jobs in the queue that are called this way, you can add an argument which your shell script disregards (which it will do by default unless you access $1 or $@ or the like in the script), by appending it to the argument line:

arguments = run_my_java_prog.sh
param1=10000 these_are_ignored

and change the submit-description file for each submission.

When you do wish to try to use an interpreter such as perl directly (which does work for at least trivial jobs), your executable will need to be the interpreter, with the script and any other arguments passed in the arguments line of the submit-description file, ie:

executable = /usr/bin/perl
arguments = -w script.pl arg1 arg2


Running a java program in the java universe:

universe = java
executable = javaprog.class
arguments = arabica
output = javaprog.output.$(Process)
error = javaprog.error.$(Process)

or with a JAR file:

universe = java
executable = CoffeeTest.jar

# Main class and other args
arguments = CoffeeTest robusta

jar_files = CoffeeTest.jar