There's a cluster of machines called the Mastodon cluster, whose machines are described in a bit of detail here.
The workflow management/queueing software used is condor (out of UW Madison). Its documentation can be found here, though it's a bit...rough.
The UTCS condor setup also encompasses some more public lab machines, in addition to just the dedicated Mastodon machines. So you can run condor jobs using slack resources in the public lab machines if that makes sense for you.
The home directory space in /u/username is hecka RAIDed and backed up and snapshotted regularly. There's a larger wild-west-style bunch of scratch space that only the Mastodon machines can see that you can get to from one of those machines at
cd /scratch/cluster/[username]with username your UTCS name. Open a helpreq to get scratch space allocated.
Submitting a job through condor requires a condor_submit config file, with some info about the job and its requirements. This will assign it a slot, consisting of a CPU, some RAM, and, optionally, a GPU.
Information about submitting via condor can be foune scrolling down a bit on this page. There's a condor_compile to build your stuff to be able to recover if the job is preempted or stopped, but it's sort of a pain, and generally easier (in one humble student's opinion) to just have your stuff checkpoint every once in a while and allow your things to start at a checkpoint, if your problem admits that (basically every NN framework does).
A bunch of students use variants of the condorizer script found on Stephen's github repo, which is a little python script that creates a condor_submit file, has a bunch of options changing the various configurable things, and calls condor_submit.
You can condor_submit from most machines, but in general the scheduler will only schedule 150 jobs from one machine. The exception is streetpizza@cs , AKA submit64@cs , which is a dedicated submit node (so don't do computationally intensive stuff from it), which the scheduler will schedule many more jobs at once from. There may be another submit node or 2 hanging out these days.
There's also a dedicated high-prio machine whose dark and eldritch power you can wield if you have an important deadline coming up. helpreq can help you get temporary access to this if you have a legit need.
The machines narsil-1 through narsil-9 (I think? Maybe it goes higher) are configured to also allow direct ssh access, in addition to being Mastodon machines (most machines in the cluster can't be ssh'd into directly). These are useful for a bunch of things (e.g. spinning up a lightweight DB server for condor jobs to access, or running tensorboard to see TF logs in scratch), but if you do longrunning CPU/RAM intensive things on the machines, expect to anger others.
The --gpu flag in the condorizer script linked to above adds what you need to request a GPU to your condor_submit script (at least as of right now---the syntactic requirements are UTCS-specific and have changed once or twice).
The eldar-N machines are the GPU machines. To see the current slots and their availability run
condor_status | grep eldar
You can log directly into eldar-1 and eldar-11, but don't run longrunning jobs on their GPUs---they're for compiling things for GPU use and testing GPU things to work. Sort of a jerk move to use these GPUs for longrunning jobs.
GPU jobs count against your userprio more than CPU jobs---something like 8x more, I think.
The CUDA stuff is accessible from the eldar machines in, e.g.,
/opt/cuda-8.0/with a few other versions in analogous dirs (e.g. 7.0, 7.5). The lib64 and bin subdirectories are probably useful if you're building stuff yourself. As of the time of writing this doc (August 2017), you're on your own with downloading your own cuDNN stuff and putting it onto your LD_LIBRARY_PATH when running/building things that need it.
condor_q -g [filter]with "[filter]" any number of possible filters---if you pass your username (`condor_q -g myname`) you'll get info on just your jobs. The -g flag means "global" and it means "show me info from jobs submitted from any machine, not just the one I'm currently on."
condor_q -g -better-analyze [jobid], with [jobid] the job number, and it'll output a whole bunch of stuff, in particular the various predicates and the cardinality of the sets of machines satisfying their intersections.
condor_rm [jobid]with jobid the job ID (the ID in the first col of condor_q). You have to issue the command from the same machine you submitted the job. To kill all of your jobs issued from that machine issue
condor_userprio. Lower is better. The scheduler isn't a strict pqueue (not totally sure exactly what it does), but the userprio matters. There's some more flags, in particular --allusers, which shows all users, not just the ones who've run stuff in the last day.
condor_q -g -long [jobid]
condor_q -g -runwhich will tell you what slot every running job is running on.
Log = [logfile]to your condor_submit script (or modify the version of condorizer you're using to generate it or whatever)---this is analogous to the Error and Output fields in that script.