UTCS operates two general-purpose High Throughput Computing (HTC) clusters, Mastodon and Scout, which are managed using the Condor job scheduling software. In addition, we have created a "cluster" out of our desktop Linux and Solaris machines.These clusters are available to the faculty, staff, and students of the department for research and educational purposes. At this point, the clusters are used almost exclusively for running batch jobs, but we have an MPI subcluster consisting of 64 processors and 16GB of RAM from the Scout cluster. Condor also supports parallel jobs using the PVM interface.
A subset of the Mastodon cluster has been set aside to allow direct logins for jobs which do not work under Condor. If you find that you cannot make your jobs work under Condor, please send email to gripe@cs, and we will either help you solve the Condor problem or arrange for login access to this group of semi-public machines. All users on these machines have to work together to avoid resource conflicts, as there is no provision for automatic resource management and no way for staff to sort out who should be using what machines at what time.
We have created a local Usenet newsgroup for discussion of our clusters and of Condor utcs.clusters. Please make use of this newsgroup, as it makes the sharing of experiences much easier.
From the Overview of the Condor Manual:
Instead of running a CPU-intensive job in the background on their own workstation, a user submits their job to Condor. Condor finds an available machine on the network and begins running the job on that machine. When Condor detects that a machine running a Condor job is no longer available (perhaps because the owner of the machine came back from lunch and started typing on the keyboard), Condor checkpoints the job and migrates it over the network to a different machine which would otherwise be idle. Condor restarts the job on the new machine to continue from precisely where it left off. If no machine on the network is currently available, then the job is stored in a queue on disk until a machine becomes available.
The UTCS Condor pool consists of two clusters of dedicated Linux x86 servers, various Linux x86 desktop machines, and Sun Solaris workstations. Most of the staff-supported machines in the department run some subset of the Condor daemons so that they can at least submit jobs and query the status of the pool. Many of them also run the job execution daemons so that they contribute to the overall computing capacity when they are not being used by their owners.
For more information on using Condor, see the Condor User's Manual section of the Condor Manual.
All of the Condor programs are in /lusr/condor/bin;
add that directory to your PATH environment variable or use that path with the commands
described below. To use condor_compile,
you do need that directory in your path.
Linking a program with Condor's libraries allows it to be checkpointed, which dumps an image of the running process to a special storage area on the cluster servers. After that it can be migrated to a different machine and restarted where it left off. In order to do this, when compiling the program, add "condor_compile" to the beginning of the command line that links the program. For example:
condor_compile /lusr/bin/gcc -O -o myprogram.condor file1.c file2.c ...
If you do not link the program with Condor's libraries, the job can still be submitted to the pool and successfully executed, but it cannot be checkpointed. Instead, if the job is interrupted by a machine failure, by the owner becoming active on the machine console, or by a higher-priority job, it will be suspended for a while, and then if the machine does not become available before the timeout period, it will be killed and restarted from the beginning elsewhere.
NEW: Condor's libraries ARE now compatible with gcc-3. See the condor_compile manual page for details.
condor_submit uses a submit description file which tells Condor what executable to use, which directory to run the program in, what command line arguments to pass to the program, and so forth. Submit descriptions can be rather elaborate, specifying what sort of machine is required by the job in great detail, or they can be as simple as the following example from the User's Manual.
####################################
#
# Example 1
# Simple condor job description file
#
####################################
Executable = foo
Log = foo.log
Queue
Submitting a job with a simple submit description like this one will assume
that the job should be run on a machine of the same architecture as
the submitting machine and will use /dev/null for stdin, stdout, and stderr.
###################################################
#
# Example 2: Show off some fancy features including
# use of pre-defined macros and logging.
#
###################################################
Executable = foo
Requirements = Memory >= 32 && OpSys == "SOLARIS29" && Arch =="SUN4u"
Rank = Memory >= 64
Image_Size = 28 Meg
Error = err.$(Process)
Input = in.$(Process)
Output = out.$(Process)
Log = foo.log
Queue 150
The above description file will submit 150 runs of program foo to the Condor pool. The program will run only on UltraSparc machines running Solaris 9 and having at least 32MB of RAM. It prefers to run on machines with at least 64MB RAM, if available. Each instance of the program will use up to 28MB of RAM while running, and each will have its own input, output, and error file with the process number appended to it (err.0, in.0, out.0 through err.149, in.149, out.149).
When you submit a job, it is possible to specify requirements which must be met by the machine which will be executing the job. In addition to the requirements described in the Condor documentation, predicates have been added locally to allow you to specify that the job must or must not run on machines in one of the department clusters. To do this, add a "requirements" line to the submit description file like:
requirements = InPhylofarm
The local predicates that are available are:
Unless you know that you need to limit your jobs to one of these groups of machines, you should not use any of these predicates. In Public is still in place mainly for historic reasons, it will prevent your jobs from running on the cluster compute nodes. Limiting your jobs to one of the clusters is useful mainly if you need to access files on the cluster's shared filesystems.
We have two other local variables defined which we use for tracking the way our clusters are used.
Each submit description file should define the user's group with one of the following:
+Group = "PROF"
+Group = "GRAD"
+Group = "UNDER"
+Group = "UTGRID"
+Group = "GUEST"
These indicate that the submitter is a CS professor, a CS graduate student, a CS undergraduate, a UTGRID partner, or a guest account, respectively. For special high-priority jobs, please email gripe@cs for more information.
Each submit description file must also define the UTCS research area to which the job relates with one of the following.
+Project = "ARCHITECTURE"
+Project = "FORMAL_METHODS"
+Project = "AI_ROBOTICS"
+Project = "OPERATING_DISTRIBUTED_SYSTEMS"
+Project = "NETWORKING_MULTIMEDIA"
+Project = "PROGRAMMING_LANGUAGES"
+Project = "THEORY"
+Project = "GRAPHICS_VISUALIZATION"
+Project = "COMPONENT_BASED_SOFTWARE"
+Project = "SCIENTIFIC_COMPUTING"
+Project = "COMPUTATIONAL_BIOLOGY"
+Project = "INSTRUCTIONAL"
+Project = "UTGRID"
+Project = "OTHER"
In addition, each submit file must define a project description field which tells a little more about what you are working on. This is a free-form text field which cannot be blank.
+ProjectDescription = "simulation of population growth of starship tribble colonies"
NOTE: Jobs submitted without valid project codes or without some sort of project description will sit idle in the queue. You will not receive a notice explaining that your job cannot run.
Include "error" and "log" path commands in your job description file. If a job mysteriously fails to run, the error and log files may contain some clue why.
#########################################
#
# Example 3: Show off some fancy features
# and local predicates.
#
#########################################
+Group = "GRAD"
+Project = "ARCHITECTURE"
+ProjectDescription = "simulating the use of dilithium substrate"
Executable = chipsim
Requirements = Memory >= 2000 && InMastodon
Rank = Memory >= 4000
Image_Size = 1900 Meg
Error = err.$(Process)
Input = in.$(Process)
Output = out.$(Process)
Log = chipsim.log
For more information and example submit files, see the condor_submit manual page and the Submitting a Job section of the User's Manual.
condor_q displays information about the jobs in the queue. Some useful examples follow.
condor_q
With no parameters specified, condor_q returns a list of jobs submitted from the current machine.
condor_q -g
This will return all jobs in the Condor pool.
condor_q -g -submitter jdoe
This will return all jobs in the queue submitted by user "jdoe".
condor_q -g -run
This will return all jobs currently running.
condor_q -g -analyze 141.0
This will give an analysis of why job 141.0 is not running.
condor_q -g -better-analyze 2248.9
This will give a detailed analysis of why job 2248.9 is not running, or it may segfault. (It is a development version.)For more information, see the condor_q manual page.
condor_status provides status information for the cluster as a whole, for individual machines, or for individual virtual machines on a node. Queries can also be limited by using various constraints. See the following examples.
condor_status
This will give the overall status of the Condor pool.
condor_status uruk-12
This will give the status of node uruk-12.
condor_status vm1@uruk-hai-2
This will give the status of virtual machine 1 on node uruk-hai-2.
condor_status -constraint 'Memory>2000'
This will give the status of machines with memory greater than 2000MB.
condor_status -constraint 'InMastodon'
This will give the status of machines in the Mastodon cluster.
For more information, see the condor_status manual page.
condor_rm is used to delete one or more jobs from the queue. Users can remove only their own jobs.
condor_rm 1234.8
This will remove job 1234.8 from the scheduler on the current machine.
condor_rm 1234
This will delete all jobs in job cluster 1234.
condor_rm jdoe -constraint Activity!="Busy"
This will remove all inactive jobs in the Condor pool owned by user "jdoe", assuming you are jdoe or a queue administrator.
condor_rm -forcex 1234.34
This will force deletion of job 1234.34 after the normal condor_rm has failed to kill it. Use this when you have jobs stuck in the "X" state for a long time.
For more information, see the condor_rm manual page.
The interpreters for languages such as Perl, Python, and Java are not linked with the Condor libraries. As a result, any jobs written in these languages need to be run in the vanilla universe. Add the following line to the submit-description file:
universe = vanilla
Java (and several other programs such as /lusr/bin/matlab that have shell script wrappers) sometimes have problems when run under Condor. For Java, you may get an error, "Error: can't find libjava.so." To work around these errors, create a shell script named something like run_my_java_prog.sh which calls your executable with the appropriate environment, for example:
#!/bin/sh
#
# shell script to execute java jobs in vanilla universe
#
export CLASSPATH=/projects/whatever/java:.
exec /lusr/java2/bin/java -arg1 arg2 ...
Then specify /bin/sh as the executable and use this script as the argument in a submit-description file containing the following:
executable = /bin/sh
arguments = run_my_java_prog.sh
In case you'd like to keep a unique identifier on your jobs in the queue that are called this way, you can add an argument which your shell script disregards (which it will do by default unless you access $1 or $@ or the like in the script), by appending it to the argument line:
arguments = run_my_java_prog.sh
param1=10000 these_are_ignored
and change the submit-description file for each submission.
When you do wish to try to use an interpreter such as perl directly (which does work for at least trivial jobs), your executable will need to be the interpreter, with the script and any other arguments passed in the arguments line of the submit-description file, ie:
executable = /lusr/bin/perl
arguments = -w script.pl arg1 arg2