General

1. How to make sure the my jobs gets schedule quickly.

Don't request more resources you need to run your job. For example, if you know your job requires a certain amount of memory, it is better to specify it in the job script so that the resource manager can better schedule your job to an appropriate node. It may also reduce query wait time. These requirements can be specified using the --mem=VALUE option in the job script.

2. Always run jobs in the most appropriate queue.

For example, if you want to run a job that requires lots of memory (> 10GB), then it is best to run it using one of the big memory queues. This can be done by requested one of the "bigmem" partitions in your job submission script (e.g. --partition=bigmem-HiPri). Otherwise your job may get killed, other users' jobs may be culled, or you could even cause a kernel panic on a compute node.

Resource Utilization

Using Multiple Nodes/Slots

1. Which MPI to use?

Unless your code requires a specific MPI library it is highly recommended that you use the latest intel-MPI installed in the ARGO cluster. Also note that the mpich does not use the high speed InfiniBand interconnect, so if you are using it with a communication heavy process then you could create a bottleneck due to slow communications.

2. Number of nodes to use

Although the cluster has > 700 nodes, it recommended that you restrict your MPI program to 128 nodes. Even with 128 nodes, it could take a while for your job to get scheduled depending on the cluster load.

GPU Based Jobs

ARGO hosts 4 GPU nodes with K80 GPUs (nodes 40, 50, 55 and 56). GPU-accelerated programs or software that was compiled with GPU support must be run on these 4 nodes. To request a GPU-node use --partition=gpuq and --gres=gpu: options in job submission script.

Troubleshooting

1. Before contacting the system admin, look in the following places to see if you find a solution for your problem:

The ARGO Wiki
Your ARGO User Guide

2. Follow the bread crumbs. To get information about a prematurely killed job check the log file created by SLURM. To get a list of failed(F), completed(CD), cancelled(CA) jobs with error codes the following command can be used,

sacct -s F,CD,CA --starttime yyyy-mm-dd -u $USER | less

where --starttime option fixes the specific beginning date from where and onward job records are to be printed.