A short writeup on using the Mastodon cluster with condor

The general setup

There's a cluster of machines called the Mastodon cluster, whose machines are described in a bit of detail here.

The workflow management/queueing software used is condor (out of UW Madison). Its documentation can be found here, though it's a bit...rough.

The UTCS condor setup also encompasses some more public lab machines, in addition to just the dedicated Mastodon machines. So you can run condor jobs using slack resources in the public lab machines if that makes sense for you.

The home directory space in /u/username is hecka RAIDed and backed up and snapshotted regularly. There's a larger wild-west-style bunch of scratch space that only the Mastodon machines can see that you can get to from one of those machines at

  cd /scratch/cluster/[username]

with username your UTCS name. Open a helpreq to get scratch space allocated.

Running jobs with condor

Submitting a job through condor requires a condor_submit config file, with some info about the job and its requirements. This will assign it a slot, consisting of a CPU, some RAM, and, optionally, a GPU.

Information about submitting via condor can be foune scrolling down a bit on this page. There's a condor_compile to build your stuff to be able to recover if the job is preempted or stopped, but it's sort of a pain, and generally easier (in one humble student's opinion) to just have your stuff checkpoint every once in a while and allow your things to start at a checkpoint, if your problem admits that (basically every NN framework does).

A bunch of students use variants of the condorizer script found on Stephen's github repo, which is a little python script that creates a condor_submit file, has a bunch of options changing the various configurable things, and calls condor_submit.

Submit/run machines

You can condor_submit from most machines, but in general the scheduler will only schedule 150 jobs from one machine. The exception is streetpizza@cs , AKA submit64@cs , which is a dedicated submit node (so don't do computationally intensive stuff from it), which the scheduler will schedule many more jobs at once from. There may be another submit node or 2 hanging out these days.

There's also a dedicated high-prio machine whose dark and eldritch power you can wield if you have an important deadline coming up. helpreq can help you get temporary access to this if you have a legit need.

The machines narsil-1 through narsil-9 (I think? Maybe it goes higher) are configured to also allow direct ssh access, in addition to being Mastodon machines (most machines in the cluster can't be ssh'd into directly). These are useful for a bunch of things (e.g. spinning up a lightweight DB server for condor jobs to access, or running tensorboard to see TF logs in scratch), but if you do longrunning CPU/RAM intensive things on the machines, expect to anger others.

Using GPUs

The --gpu flag in the condorizer script linked to above adds what you need to request a GPU to your condor_submit script (at least as of right now---the syntactic requirements are UTCS-specific and have changed once or twice).

The eldar-N machines are the GPU machines. To see the current slots and their availability run

condor_status | grep eldar

You can log directly into eldar-1 and eldar-11, but don't run longrunning jobs on their GPUs---they're for compiling things for GPU use and testing GPU things to work. Sort of a jerk move to use these GPUs for longrunning jobs.

GPU jobs count against your userprio more than CPU jobs---something like 8x more, I think.

The CUDA stuff is accessible from the eldar machines in, e.g.,

/opt/cuda-8.0/

with a few other versions in analogous dirs (e.g. 7.0, 7.5). The lib64 and bin subdirectories are probably useful if you're building stuff yourself. As of the time of writing this doc (August 2017), you're on your own with downloading your own cuDNN stuff and putting it onto your LD_LIBRARY_PATH when running/building things that need it.

Miscellaneous useful commands