General
1. How to make sure the my jobs gets schedule quickly.
Don't request more resources you need to run your job. For example, if
you know your job requires a certain amount of memory, it is better to
specify it in the job script so that the resource manager can better
schedule your job to an appropriate node. It may also reduce query wait
time. These requirements can be specified using the --mem=VALUE
option
in the job script.
2. Always run jobs in the most appropriate queue.
For example, if you want to run a job that requires lots of memory (>
10GB), then it is best to run it using one of the big memory queues.
This can be done by requested one of the "bigmem" partitions in your job
submission script (e.g. --partition=bigmem-HiPri
). Otherwise your job
may get killed, other users' jobs may be culled, or you could even cause
a kernel panic on a compute node.
Resource Utilization
Using Multiple Nodes/Slots
1. Which MPI to use?
Unless your code requires a specific MPI library it is highly
recommended that you use the latest intel-MPI installed in the ARGO
cluster. Also note that the mpich
does not use the high speed
InfiniBand interconnect, so if you are using it with a communication
heavy process then you could create a bottleneck due to slow
communications.
2. Number of nodes to use
Although the cluster has > 700 nodes, it recommended that you restrict your MPI program to 128 nodes. Even with 128 nodes, it could take a while for your job to get scheduled depending on the cluster load.
GPU Based Jobs
ARGO hosts 4 GPU nodes with K80 GPUs (nodes 40, 50, 55 and 56).
GPU-accelerated programs or software that was compiled with GPU support
must be run on these 4 nodes. To request a GPU-node use
--partition=gpuq
and --gres=gpu:
Troubleshooting
1. Before contacting the system admin, look in the following places to see if you find a solution for your problem:
- The ARGO Wiki
- Your ARGO User Guide
2. Follow the bread crumbs. To get information about a prematurely killed job check the log file created by SLURM. To get a list of failed(F), completed(CD), cancelled(CA) jobs with error codes the following command can be used,
sacct -s F,CD,CA --starttime yyyy-mm-dd -u $USER | less
where --starttime option fixes the specific beginning date from where and onward job records are to be printed.